feat(deploy): AI Studio production deployment — compose + Swarm paths#32
Draft
librowski wants to merge 11 commits into
Draft
feat(deploy): AI Studio production deployment — compose + Swarm paths#32librowski wants to merge 11 commits into
librowski wants to merge 11 commits into
Conversation
…t localhost Worker.create without an explicit connection dials 127.0.0.1:7233, ignoring the TEMPORAL_ADDRESS env var — correct in local dev, broken in any deployment where Temporal is not on loopback.
…h-all The no_branch_matched error message and the Sales Inquiry reference template both treat a branch with no conditions as the explicit catch-all, but branchMatches returned false for it — any input classified outside the keyword branches failed the whole run. Supersedes the empty-conditions bullet of packages/execution-core/decision-no-match.decision-log.md (the strict fail-fast core of that decision is unchanged); see apps/execution-worker/decision-catch-all.decision-log.md.
Fixed-window, in-memory limiter (WB-229 abuse gate). Disabled unless RATE_LIMIT_EXECUTE_PER_MINUTE / RATE_LIMIT_EXECUTE_PER_DAY are set, so local dev is unaffected. TRUST_PROXY=true reads the client IP from X-Forwarded-For — only enable behind a proxy that sets it.
deploy/ai-studio/: multi-target Dockerfile (runtime/migrate/web), production docker-compose (only nginx public, pinned images, automatic migrations), nginx SPA+API proxy with SSE tuning and per-request DNS re-resolution, .env.example with Mistral Small 3.2 default, DevOps README, and a decision log covering the architecture choices. tsx becomes a real dependency of backend and worker (start:prod runs without an .env file); .dockerignore now keeps **/.env out of build contexts. Verified end-to-end: Sales Inquiry Pipeline to execution_completed with live SSE through nginx; rate limiter returns 429 past the budget.
tools/deployment/ mirrors the workflow-builder repo's deployment path (build-docker.sh, deploy.sh, ansible deploy-application playbook) and consumes the same three images from deploy/ai-studio/Dockerfile — only the orchestration layer differs. Deviations forced by AI Studio being stateful: node-pinned volumes for Postgres/Temporal, post-deploy migration step (Swarm ignores depends_on), attachable internal network with short DNS aliases, and an AUTH_ENABLED-gated gatekeeper so the public demo stays login-free. Stack template render-verified in both auth modes; status 'Proposed' pending the DevOps conversation.
drizzle-orm's programmatic migrator runs the SQL files from apps/backend/drizzle/ before the server accepts traffic. A failure (database still starting) exits the process; container restart policies retry until it converges. drizzle-kit stays a devDependency — db:migrate remains available for out-of-band use.
The backend migrates itself at boot, so the migrate Dockerfile target, compose service, and the Swarm playbook's post-deploy migration task (plus its attachable-network requirement) all go away. Two images remain: runtime and web. The worker now waits for the backend healthcheck so it never touches a pre-migration schema. Verified on a wiped stack: virgin database boots, backend logs 'database migrations applied' before listening, Sales Inquiry Pipeline runs to execution_completed over live SSE, rate limiter returns 429 past the budget.
Keep only what the code cannot say itself: traps (Worker.create's implicit localhost, corepack/pnpm 10 failure, offline mode leaking into lifecycle scripts), constraints (single-replica migrator, X-Forwarded-For trust), and magic values. Drop the narration.
Reverts the empty-conditions-as-catch-all change (ccf7375) and its decision log. It changes execution semantics and supersedes a clause of decision-no-match.decision-log.md — that deserves a focused review, not a ride-along in a deployment PR.
AI Studio is a POC — the README and the comments on the non-obvious pieces carry what operators need; full architecture rationale is premature at this stage.
… swarm stack Backend healthcheck (fetch /api/health) lets Swarm detect when migrations are done — without it the worker can hit a pre-migration schema. Explicit restart_policy on every service replaces the implicit Swarm default; crash-looping services (worker, temporal) get max_attempts. Internal network gets driver: overlay for clarity.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements WB-229 — AI Studio public demo, phase A (lean MVP).
What this adds
deploy/ai-studio/— portable single-host deployment: one multi-target Dockerfile (runtime= backend+worker via tsx,web= nginx serving the SPA +/apiproxy with SSE tuning and per-request DNS re-resolution), a production compose file (only nginx publishes a port; Postgres ×2, Temporal, backend stay internal; Temporal UI behind adebugprofile; pinned images, no:latest),.env.example, and a DevOps runbook.tools/deployment/— Swarm overlay mirroring the workflow-builder repo's deploy machinery (ACR commit-tagged images, Traefik labels, Ansible playbook; gatekeeper optional viaAUTH_ENABLED). Same images, different orchestration. Status: proposed, pending DevOps sign-off — not yet exercised against the real cluster.X-Forwarded-Foronly behind our nginx. The money cap is the OpenRouter account Guardrail (dashboard-side, see runbook).mistralai/mistral-small-3.2-24b-instructvia env (pricing re-verified 2026-06-10: $0.075/$0.20 per Mtok ≈ $0.0004/run)..dockerignorehardened:**/.env*can no longer enter a build context.TEMPORAL_ADDRESS—Worker.createwithout an explicit connection dials127.0.0.1:7233; invisible in dev, fatal in containers.Related: #33 — decision-node catch-all fix, found by this PR's verification but split out for its own review.
Verification
database migrations appliedbefore listening; worker converges without crash-looping.execution_completedover live Mistral calls with SSE streaming through nginx; rate limiter returns 429 past the minute budget.apps/docsastro-check failure pre-exists onmain).Not in this PR (ops-side, tracked in WB-229)
Azure VM / cluster provisioning, DNS + TLS, OpenRouter Guardrail configuration.