Skip to content

fix(build): apt bins to app stage, per-binary COPYs, no more lost runtime deps (closes #149)#150

Merged
initializ-mk merged 3 commits into
mainfrom
fix/issue-149-bins-runtime-deps
Jun 11, 2026
Merged

fix(build): apt bins to app stage, per-binary COPYs, no more lost runtime deps (closes #149)#150
initializ-mk merged 3 commits into
mainfrom
fix/issue-149-bins-runtime-deps

Conversation

@initializ-mk

Copy link
Copy Markdown
Contributor

Closes #149. Stacked on top of #148.

Summary

apt-installed binaries declared by skills were silently lost in the runtime image. The bundled `code-review` skill needs `curl`, `git`, `jq`, `gh` — only `gh` survived. Any skill script calling `curl`/`git`/`jq` would log "command not found" at runtime.

Also addressed: the wholesale-directory `COPY --from=bins /usr/local/bin/ /usr/local/bin/` that masked the bug — replaced with explicit per-binary `COPY` lines for every binary that reaches the app stage.

Root cause

The pre-fix flow put every skill bin through a single shared `bins` stage:

```
FROM debian:bookworm-slim AS bins
RUN apt-get install -y curl git jq # ← apt lands these at /usr/bin/
RUN curl ... | tar xz ... && mv .../gh /usr/local/bin/gh

--- Application stage ---

COPY --from=bins /usr/local/bin/ /usr/local/bin/ # ← only catches gh
```

apt drops binaries at `/usr/bin/` per Debian convention, with transitive deps in `/usr/lib/`, `/etc/`. Copying just `/usr/local/bin/` from the bins stage left them behind. And even a per-binary `COPY --from=bins /usr/bin/git` wouldn't work because git depends on `/usr/lib/git-core/` and other support files.

Fix

Restructured `packaging.GenerateDockerfile` to return a structured `DockerfileFragments` separating concerns by install method:

Install method Where it lives now
apt / apk Runtime install in the application stage — apt's dep resolution handles libs + config.
direct-URL (kubectl, etc.) Bins stage download + per-binary `COPY --from=bins /path /path` in app stage.
custom-RUN (gh tarball etc.) Bins stage + per-binary `COPY --from=bins`.
image-copy (playwright etc.) Per-binary `COPY --from=bin-` from the dedicated companion stage — no bins-stage intermediate hop.
local-file Per-binary `COPY .local-bins/` in app stage — no bins stage needed.

`DockerfileFragments` exposes:

```go
type DockerfileFragments struct {
PreAppStages string // companion image-copy stages + shared bins stage
RuntimeAptPackages []string // app stage installs
RuntimeApkPackages []string // app stage installs (alpine)
BinCopies []string // explicit per-binary COPY lines for app stage
PathExtensions []string
}
```

The bins stage's apt install is now strictly `curl ca-certificates` (build-time helpers only), scoped to the stage and never propagating to the runtime image.

End-to-end repro

aibuilderdemo (declares curl/git/jq via apt, gh via custom-RUN):

Pre-fix Post-fix
Bins-stage apt install `curl git jq` (runtime bins routed through wrong stage) `curl ca-certificates` (build-time only)
App-stage `COPY --from=bins` wholesale `/usr/local/bin/` (loses /usr/bin contents) `COPY --from=bins /usr/local/bin/gh /usr/local/bin/gh` (per-binary)
App-stage apt install `ca-certificates` only `ca-certificates curl git jq` (runtime deps in app stage)
Binaries in runtime image gh, forge, ca-certificates (curl/git/jq missing) gh, forge, curl, git, jq, ca-certificates ✓

Generated app-stage excerpt post-fix:

```
WORKDIR /app
COPY --from=bins /usr/local/bin/gh /usr/local/bin/gh
COPY . .
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates curl git jq && rm -rf /var/lib/apt/lists/*
```

Files changed

  • `forge-core/packaging/dockerfile_generator.go` — restructured to return `DockerfileFragments`; apt/apk routed to runtime, per-binary COPYs collected for app stage, image-copy bypasses bins
  • `forge-core/packaging/dockerfile_generator_test.go` — rewrote tests against the new contract per install method, plus a regression test pinning the bundled code-review skill shape (`TestGenerateDockerfile_MixedAptAndDirectURL`)
  • `forge-core/compiler/template_data.go` — added `BinCopies`, `RuntimeAptPackages`, `RuntimeApkPackages`, `PathExtensions` fields
  • `forge-cli/build/dockerfile_stage.go` — populates the new fields from the fragments
  • `forge-cli/templates/Dockerfile.tmpl` — replaced wholesale-directory `COPY` with a per-binary loop; runtime apt/apk packages installed in the app stage; PATH extensions emitted at end

Test plan

  • `go test ./...` from `forge-skills/`, `forge-core/`, `forge-cli/` — all pass
  • `golangci-lint run` — 0 issues
  • End-to-end build on `aibuilderdemo` — Dockerfile inspected by hand; matches the post-fix shape above
  • Regression test `TestGenerateDockerfile_MixedAptAndDirectURL` covers the exact bundled-skill shape that triggered the bug
  • Per-install-method tests confirm: apt-only manifests produce zero bins stage, direct-URL manifests produce a bins stage + per-binary COPYs, image-copy manifests skip bins entirely

Out of scope

  • Multi-arch image-copy from upstream single-binary images (e.g. `alpine/git`). Registry-level decision per binary, tracked separately.
  • The `forge` binary install is independent of the bins/app split — still installed inline in the app stage via `curl | tar` and not touched.

Stacked on

This PR targets the `fix/issue-147-build-output` branch (PR #148) rather than `main` so the `Dockerfile.tmpl` changes from #148 don't conflict. Will rebase to `main` after #148 merges.

… fix checksums path (closes #147)

Five interlocking bugs in the `forge build` output that compound to a
bloated runtime image and a silently-disabled integrity check:

1. **Prompt-builder body duplication.** A SKILL.md with N `## Tool:`
   sections parses into N SkillEntry values that all share the same
   Body. The compiler's prompt builder appended the body once per
   entry, so the bundled code-review-github skill (4 tools) shipped
   4× repeats and aibuilderdemo's compiled/prompt.txt was 1199 lines
   for what should be ~250. Dedup via a seen-bodies set; per-skill
   JSON entries still carry the full Body for SDK consumers.

2. **compiled/prompt.txt and compiled/skills/skills.json are dead.**
   The runtime never opens either file — runner.discoverSkillFiles()
   re-globs skills/SKILL.md on every startup. Stopped generating
   them; deleted forge-cli/skills/writer.go; pruned the matching
   COPY lines from the Dockerfile template and the validate stage's
   required-files check. External library consumers
   (forgecore.Compile) still receive the in-memory CompiledSkills
   struct directly.

3. **Dockerfile `COPY . .` leaked operator-side artifacts into
   /app.** Expanded .dockerignore to exclude k8s/, build-manifest.json,
   the entire compiled/ directory, the Dockerfile itself, the
   .dockerignore itself, and .local-bins/. The runtime image now
   contains only forge.yaml, guardrails.json, <channel>-config.yaml,
   agent.json, policy-scaffold.json, skills/, and checksums.json.

4. **checksums.json integrity check silently skipped in container.**
   The runner looked at <WorkDir>/.forge-output/checksums.json,
   which never exists inside a Forge container — the file lands at
   /app/checksums.json directly. Added a fallback that picks the
   flattened layout when the operator-side layout is absent.

5. **Removed redundant COPY lines from Dockerfile.tmpl** that would
   have failed after fix #2 anyway (the targets no longer exist).

End-to-end verified on the user's aibuilderdemo agent:
- Pre-fix: 21 files in .forge-output, compiled/prompt.txt 1199 lines,
  image contained k8s/ + build-manifest.json + security-audit.json +
  recursive Dockerfile + dead compiled/ artifacts.
- Post-fix: 19 files (compiled/prompt.txt + compiled/skills/skills.json
  removed), image contains only the 8 files the runtime opens,
  operator artifacts intact on disk for forge package / kubectl apply.

Tests pin all four invariants:
- TestCompile_BodyDeduplication_* (3 tests)
- TestSkillsStage_WithSkills updated to assert artifacts NOT created
- TestDockerignore_ExcludesOperatorSideArtifacts +
  TestDockerignore_KeepsRuntimeFiles +
  TestDockerfile_NoLongerCopiesDeadCompiledArtifacts
- TestResolveChecksumsDir_* (3 tests) +
  TestVerifyBuildOutput_ContainerLayoutWorksEndToEnd
…time deps (closes #149)

apt-installed binaries declared by skills were silently lost in the
final image. The pre-fix flow: every skill bin (curl, git, jq, gh)
went through a single shared `bins` stage. apt placed curl/git/jq at
/usr/bin/ (Debian's default), but the application stage only did
`COPY --from=bins /usr/local/bin/ /usr/local/bin/` — picking up just
the manually-placed `gh` and dropping the apt-installed bins entirely.
Any skill script that called curl/git/jq would log "command not
found" at runtime; only `gh` worked because its install path
explicitly moved it to /usr/local/bin/gh.

The blunt wholesale-directory COPY also hid the bug — there was no
per-binary contract to assert against. New code, new artifact, easy
to miss in review.

This change separates the bins-stage and app-stage concerns:

- apt/apk binaries → installed in the application stage directly via
  the existing `RUN apt-get install` slot. The package manager's
  dependency resolution pulls /usr/lib/, /etc/, etc. along
  automatically, so per-bin `COPY --from=bins` plumbing (which can't
  preserve lib deps) is avoided.
- direct-URL + custom-RUN binaries → still flow through the shared
  `bins` stage (they need curl + a writable filesystem to download
  and unpack). The bins stage's apt install is now strictly
  scoped to build-time helpers (curl + ca-certificates) that never
  reach the app image.
- image-copy binaries (e.g. playwright) → copied straight from their
  `bin-<name>` companion stage to the app stage, skipping the bins
  intermediate hop.
- local-file binaries → emitted as `COPY .local-bins/<name>` in the
  app stage (no `--from`, no bins-stage routing).

`packaging.GenerateDockerfile` now returns a structured
`DockerfileFragments` with separate slots for `PreAppStages`,
`RuntimeAptPackages`, `RuntimeApkPackages`, `BinCopies`, and
`PathExtensions`. The template consumes each slot in the right place;
the wholesale `COPY --from=bins /usr/local/bin/ /usr/local/bin/` is
gone, replaced by an explicit per-binary loop.

End-to-end verified on the bundled aibuilderdemo agent (curl/git/jq
via apt, gh via custom-RUN). The pre-fix Dockerfile had curl/git/jq
trapped in the bins stage and dropped on COPY. Post-fix:

  FROM debian:bookworm-slim AS bins
  RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates && rm -rf /var/lib/apt/lists/*
  RUN curl -fsSL https://github.com/cli/cli/releases/.../gh_2.60.0_linux_amd64.tar.gz | tar xz -C /tmp
  RUN mv /tmp/.../gh /usr/local/bin/gh
  RUN chmod 0755 /usr/local/bin/gh

  # --- Application stage ---
  FROM debian:bookworm-slim
  ...
  WORKDIR /app
  COPY --from=bins /usr/local/bin/gh /usr/local/bin/gh
  COPY . .
  RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates curl git jq && rm -rf /var/lib/apt/lists/*
  ...
  ENTRYPOINT ["forge","run","--host","0.0.0.0","--with","slack"]

Tests pin the new contract per install method (apt/apk in app stage,
direct-URL/custom-RUN via per-binary COPY from bins, image-copy via
per-binary COPY from companion stage, local-file from build context),
plus a regression test for the bundled code-review skill shape
(curl+git+jq+gh) confirming all four reach the final image.
…all methods, how to add a bin

Adds a dedicated reference for the bin pipeline so contributors and
operators don't have to read forge-core/packaging/ to know how to
declare a new tool in a SKILL.md or override one at the project level.

Covers:
- The four-source priority walker (local-file → skill-local →
  forge.yaml override → registry → fallback), with the actual
  precedence implemented in bin_classifier.go.
- The six install methods (apt, apk, direct-URL, custom-run,
  image-copy, local-file) and which Dockerfile slot each routes
  through after the issue #149 split.
- The four ways to add a binary: use a registry entry, declare inline
  in SKILL.md, override in forge.yaml, or supply a local file via
  --local-bin. Each path with a copy-pasteable example.
- A registry-entry field reference for contributors adding to
  image-registry.yaml.
- A quick decision tree and a sample of the post-#149 generated
  Dockerfile so readers see what `forge build` actually emits.

Cross-links added in:
- docs/core-concepts/skill-md-format.md — the bins: bullet now points
  here for resolution rules.
- docs/reference/forge-yaml-schema.md — package.bin_overrides points
  here for source priority.
- docs/reference/cli-reference.md — --local-bin rows in both `forge
  build` and `forge package` point at the local-file section.
- docs/deployment/docker.md — bin pipeline reference under "Building
  Agent Container Images".
- README.md — Core Concepts table gains a row.

sync-docs mapping table extended so future changes to
forge-skills/registry/image-registry.yaml,
forge-core/packaging/, forge-cli/templates/Dockerfile.tmpl, or
forge-cli/build/dockerfile_stage.go refresh this doc.
@initializ-mk initializ-mk changed the base branch from fix/issue-147-build-output to main June 11, 2026 15:23
@initializ-mk initializ-mk reopened this Jun 11, 2026
@initializ-mk initializ-mk merged commit 7d28f45 into main Jun 11, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(build): apt-installed binaries silently lost in runtime image — wrong path between bins stage and app stage

1 participant