Skip to content

feat(deploy): AI Studio production deployment — compose + Swarm paths#32

Draft
librowski wants to merge 11 commits into
mainfrom
WB-229-swarm-alignment
Draft

feat(deploy): AI Studio production deployment — compose + Swarm paths#32
librowski wants to merge 11 commits into
mainfrom
WB-229-swarm-alignment

Conversation

@librowski

@librowski librowski commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Implements WB-229 — AI Studio public demo, phase A (lean MVP).

What this adds

  • deploy/ai-studio/ — portable single-host deployment: one multi-target Dockerfile (runtime = backend+worker via tsx, web = nginx serving the SPA + /api proxy with SSE tuning and per-request DNS re-resolution), a production compose file (only nginx publishes a port; Postgres ×2, Temporal, backend stay internal; Temporal UI behind a debug profile; pinned images, no :latest), .env.example, and a DevOps runbook.
  • tools/deployment/ — Swarm overlay mirroring the workflow-builder repo's deploy machinery (ACR commit-tagged images, Traefik labels, Ansible playbook; gatekeeper optional via AUTH_ENABLED). Same images, different orchestration. Status: proposed, pending DevOps sign-off — not yet exercised against the real cluster.
  • Migrations on backend boot — drizzle-orm's programmatic migrator runs the SQL files before the server accepts traffic; no migrate image/service/playbook step. Worker is gated on the backend healthcheck (healthy = schema applied). Single-replica assumption, stated in code.
  • Per-IP rate limit on the execute route (WB-229's abuse gate): off in local dev, 10/min + 50/day in the deploy; trusts X-Forwarded-For only behind our nginx. The money cap is the OpenRouter account Guardrail (dashboard-side, see runbook).
  • Demo model wired to mistralai/mistral-small-3.2-24b-instruct via env (pricing re-verified 2026-06-10: $0.075/$0.20 per Mtok ≈ $0.0004/run).
  • .dockerignore hardened: **/.env* can no longer enter a build context.
  • Bug fix: worker ignored TEMPORAL_ADDRESSWorker.create without an explicit connection dials 127.0.0.1:7233; invisible in dev, fatal in containers.

Related: #33 — decision-node catch-all fix, found by this PR's verification but split out for its own review.

Verification

  • Wiped volumes → virgin boot: backend logs database migrations applied before listening; worker converges without crash-looping.
  • Sales Inquiry Pipeline ran to execution_completed over live Mistral calls with SSE streaming through nginx; rate limiter returns 429 past the minute budget.
  • 63 backend tests green; lint + typecheck clean (the apps/docs astro-check failure pre-exists on main).

Not in this PR (ops-side, tracked in WB-229)

Azure VM / cluster provisioning, DNS + TLS, OpenRouter Guardrail configuration.

…t localhost

Worker.create without an explicit connection dials 127.0.0.1:7233,
ignoring the TEMPORAL_ADDRESS env var — correct in local dev, broken in
any deployment where Temporal is not on loopback.
…h-all

The no_branch_matched error message and the Sales Inquiry reference
template both treat a branch with no conditions as the explicit
catch-all, but branchMatches returned false for it — any input
classified outside the keyword branches failed the whole run.

Supersedes the empty-conditions bullet of
packages/execution-core/decision-no-match.decision-log.md (the strict
fail-fast core of that decision is unchanged); see
apps/execution-worker/decision-catch-all.decision-log.md.
Fixed-window, in-memory limiter (WB-229 abuse gate). Disabled unless
RATE_LIMIT_EXECUTE_PER_MINUTE / RATE_LIMIT_EXECUTE_PER_DAY are set, so
local dev is unaffected. TRUST_PROXY=true reads the client IP from
X-Forwarded-For — only enable behind a proxy that sets it.
deploy/ai-studio/: multi-target Dockerfile (runtime/migrate/web),
production docker-compose (only nginx public, pinned images, automatic
migrations), nginx SPA+API proxy with SSE tuning and per-request DNS
re-resolution, .env.example with Mistral Small 3.2 default, DevOps
README, and a decision log covering the architecture choices.

tsx becomes a real dependency of backend and worker (start:prod runs
without an .env file); .dockerignore now keeps **/.env out of build
contexts.

Verified end-to-end: Sales Inquiry Pipeline to execution_completed with
live SSE through nginx; rate limiter returns 429 past the budget.
tools/deployment/ mirrors the workflow-builder repo's deployment path
(build-docker.sh, deploy.sh, ansible deploy-application playbook) and
consumes the same three images from deploy/ai-studio/Dockerfile — only
the orchestration layer differs. Deviations forced by AI Studio being
stateful: node-pinned volumes for Postgres/Temporal, post-deploy
migration step (Swarm ignores depends_on), attachable internal network
with short DNS aliases, and an AUTH_ENABLED-gated gatekeeper so the
public demo stays login-free.

Stack template render-verified in both auth modes; status 'Proposed'
pending the DevOps conversation.
drizzle-orm's programmatic migrator runs the SQL files from
apps/backend/drizzle/ before the server accepts traffic. A failure
(database still starting) exits the process; container restart policies
retry until it converges. drizzle-kit stays a devDependency — db:migrate
remains available for out-of-band use.
The backend migrates itself at boot, so the migrate Dockerfile target,
compose service, and the Swarm playbook's post-deploy migration task
(plus its attachable-network requirement) all go away. Two images
remain: runtime and web. The worker now waits for the backend
healthcheck so it never touches a pre-migration schema.

Verified on a wiped stack: virgin database boots, backend logs
'database migrations applied' before listening, Sales Inquiry Pipeline
runs to execution_completed over live SSE, rate limiter returns 429
past the budget.
Keep only what the code cannot say itself: traps (Worker.create's
implicit localhost, corepack/pnpm 10 failure, offline mode leaking into
lifecycle scripts), constraints (single-replica migrator, X-Forwarded-For
trust), and magic values. Drop the narration.
Reverts the empty-conditions-as-catch-all change (ccf7375) and its
decision log. It changes execution semantics and supersedes a clause of
decision-no-match.decision-log.md — that deserves a focused review, not
a ride-along in a deployment PR.
AI Studio is a POC — the README and the comments on the non-obvious
pieces carry what operators need; full architecture rationale is
premature at this stage.
… swarm stack

Backend healthcheck (fetch /api/health) lets Swarm detect when
migrations are done — without it the worker can hit a pre-migration
schema. Explicit restart_policy on every service replaces the implicit
Swarm default; crash-looping services (worker, temporal) get
max_attempts. Internal network gets driver: overlay for clarity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants