-
Notifications
You must be signed in to change notification settings - Fork 1
docs: restructure self-hosting sidebar; add Production subsection and Support #705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: docs/observe-concepts
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,137 +1,28 @@ | ||
| --- | ||
| title: "Production Hardening & Operations" | ||
| description: "Production readiness checklist — replace secrets, configure TLS, set up managed data stores, run Postgres/ClickHouse/MinIO backups, and follow the upgrade runbook." | ||
| title: "Production" | ||
| description: "What to harden before a self-hosted Future AGI instance goes live" | ||
| --- | ||
|
|
||
| ## About | ||
| Everything past a local trial happens here. The default Docker Compose stack boots with placeholder secrets, no TLS, and compose-managed data stores. That's fine on a laptop. Before real traffic reaches the instance, work through the flow below in order, then keep each page as a runbook. | ||
|
|
||
| Run through this before exposing the stack to real users. Covers secrets, TLS, swapping in managed data stores, backup commands for Postgres/ClickHouse/MinIO, Prometheus monitoring, and the upgrade and rollback runbook. | ||
| ## In this page | ||
|
|
||
| ## Hardening checklist | ||
|
|
||
| **Secrets** — replace all `CHANGEME` values before going live: | ||
|
|
||
| ```bash | ||
| openssl rand -hex 32 # SECRET_KEY, AGENTCC_INTERNAL_API_KEY | ||
| openssl rand -base64 24 # PG_PASSWORD, MINIO_ROOT_PASSWORD | ||
| ``` | ||
|
|
||
| **Runtime flags** in `.env`: | ||
| - `ENV_TYPE=prod` | ||
| - `FAST_STARTUP=false` | ||
| - `GRANIAN_WORKERS=<your CPU count>` | ||
|
|
||
| **TLS** — the frontend and backend don't terminate TLS. Put Caddy, nginx, or Traefik in front: | ||
|
|
||
| ``` | ||
| # Caddyfile (simplest — auto-issues Let's Encrypt certs) | ||
| app.yourcompany.com { reverse_proxy localhost:3000 } | ||
| api.yourcompany.com { reverse_proxy localhost:8000 } | ||
| ``` | ||
|
|
||
| After setting up TLS, set `VITE_HOST_API=https://api.yourcompany.com` in `.env` and rebuild: | ||
|
|
||
| ```bash | ||
| docker compose build frontend && docker compose up -d frontend | ||
| ``` | ||
|
|
||
| **Managed data stores** — for production, replace compose-managed services: | ||
|
|
||
| | Replace | With | Change | | ||
| |---|---|---| | ||
| | `postgres` | RDS / Aurora / Cloud SQL | Set `PG_*` vars to managed endpoint | | ||
| | `clickhouse` | ClickHouse Cloud | Set `CH_HOST`, `CH_PORT`, etc. | | ||
| | `redis` | ElastiCache / Upstash | Set `REDIS_URL` | | ||
| | `minio` | AWS S3 | Set `S3_ENDPOINT_URL=https://s3.amazonaws.com` + AWS creds | | ||
|
|
||
| <Note> | ||
| `code-executor` requires `privileged: true`. Run on EC2 / GCE instances — not Fargate or Cloud Run. | ||
| </Note> | ||
|
|
||
| **Secrets manager** — use AWS Secrets Manager, HashiCorp Vault, or GCP Secret Manager instead of a plain `.env` file. | ||
|
|
||
| --- | ||
|
|
||
| ## Backups | ||
|
|
||
| ### PostgreSQL | ||
|
|
||
| ```bash | ||
| # Backup | ||
| docker compose exec postgres \ | ||
| pg_dump -U futureagi -d futureagi --format=custom \ | ||
| > backup-$(date +%F).dump | ||
|
|
||
| # Restore | ||
| docker compose exec -T postgres \ | ||
| pg_restore -U futureagi -d futureagi --clean --if-exists \ | ||
| < backup-2026-04-22.dump | ||
| ``` | ||
|
|
||
| Volumes: `future-agi_postgres-data` · `future-agi_clickhouse-data` · `future-agi_redis-data` · `future-agi_minio-data` · `future-agi_peerdb-catalog-data` · `future-agi_peerdb-minio-data` | ||
|
|
||
| ### ClickHouse | ||
|
|
||
| ```sql | ||
| BACKUP TABLE default.traces TO S3('s3://your-bucket/ch-backup/', 'KEY', 'SECRET'); | ||
| ``` | ||
|
|
||
| ClickHouse data can also be rebuilt from scratch by re-running PeerDB init since it replicates from Postgres. | ||
|
|
||
| ### MinIO | ||
|
|
||
| ```bash | ||
| mc alias set local http://localhost:9005 futureagi <MINIO_ROOT_PASSWORD> | ||
| mc alias set s3 https://s3.amazonaws.com <AWS_KEY> <AWS_SECRET> | ||
| mc mirror local/ s3/your-bucket/ | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Monitoring | ||
|
|
||
| Backend exposes Prometheus metrics at `http://localhost:8000/metrics`. Add a scraper: | ||
|
|
||
| ```yaml | ||
| # prometheus.yml | ||
| scrape_configs: | ||
| - job_name: futureagi | ||
| static_configs: | ||
| - targets: ['localhost:8000'] | ||
| metrics_path: /metrics | ||
| ``` | ||
|
|
||
| Key signals: backend error rate, Temporal workflow success/failure, Postgres WAL lag (PeerDB health), ClickHouse query latency, PeerDB mirror status at [localhost:3001](http://localhost:3001). | ||
|
|
||
| --- | ||
|
|
||
| ## Upgrades | ||
|
|
||
| ```bash | ||
| git pull | ||
| docker compose build | ||
| docker compose up -d | ||
| ``` | ||
|
|
||
| Migrations run automatically. If a migration fails: `docker compose exec backend python manage.py migrate` | ||
|
|
||
| If release notes mention PeerDB changes: `docker compose run --rm peerdb-init bash /setup.sh` | ||
|
|
||
| **Rollback:** | ||
|
|
||
| ```bash | ||
| git log --oneline -5 | ||
| git checkout <previous-hash> | ||
| docker compose build && docker compose up -d | ||
| ``` | ||
|
|
||
| ## Next Steps | ||
| Production readiness for a self-hosted instance breaks into five steps. Do them in order the first time. | ||
|
|
||
| <CardGroup cols={2}> | ||
| <Card title="Troubleshooting" icon="wrench" href="/docs/self-hosting/troubleshooting"> | ||
| Symptoms, causes, and fixes for common errors. | ||
| <Card title="Checklist" icon="list-check" href="/docs/self-hosting/production/checklist"> | ||
| The go-live pass: secrets, prod runtime flags, and managed data stores | ||
| </Card> | ||
| <Card title="Security & TLS" icon="shield" href="/docs/self-hosting/production/security-tls"> | ||
| Terminate TLS in front of the stack and lock down secrets | ||
| </Card> | ||
| <Card title="Backups & restore" icon="database" href="/docs/self-hosting/production/backups-restore"> | ||
| Back up and restore Postgres, ClickHouse, and MinIO | ||
| </Card> | ||
| <Card title="Monitoring" icon="gauge" href="/docs/self-hosting/production/monitoring"> | ||
| Scrape Prometheus metrics and watch the signals that matter | ||
| </Card> | ||
| <Card title="System Configuration" icon="sliders" href="/docs/self-hosting/configuration"> | ||
| Tune the LLM gateway, PeerDB mirrors, and Temporal workers. | ||
| <Card title="Upgrades & rollback" icon="arrows-rotate" href="/docs/self-hosting/production/upgrades-rollback"> | ||
| Pull a release, run migrations, and roll back safely | ||
| </Card> | ||
| </CardGroup> |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| --- | ||
| title: "Backups & restore" | ||
| description: "Back up and restore the data stores behind a self-hosted instance" | ||
| --- | ||
|
|
||
| A self-hosted instance keeps state in four stores: Postgres for application data, ClickHouse for observability records, MinIO for object storage, and Redis for cache and queues. This page covers backing up and restoring the three that hold durable data. Redis is a cache and doesn't need a backup. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 'Cache and queues' at the start, 'a cache' at the end. If it holds queues, losing it drops queued work. Either say it's safe to lose because it rebuilds, or drop 'queues' |
||
|
|
||
| ## Postgres | ||
|
|
||
| Postgres holds the application data, so back it up on a schedule. Use the custom format so restores can run selectively: | ||
|
|
||
| ```bash | ||
| # Backup | ||
| docker compose exec postgres \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add |
||
| pg_dump -U futureagi -d futureagi --format=custom \ | ||
| > backup-$(date +%F).dump | ||
|
|
||
| # Restore | ||
| docker compose exec -T postgres \ | ||
| pg_restore -U futureagi -d futureagi --clean --if-exists \ | ||
| < backup-2026-04-22.dump | ||
| ``` | ||
|
|
||
| The Docker volumes that hold state: | ||
|
|
||
| | Volume | Holds | | ||
| |---|---| | ||
| | `future-agi_postgres-data` | Postgres application data | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wrong prefix. The compose file sets |
||
| | `future-agi_clickhouse-data` | ClickHouse records | | ||
| | `future-agi_redis-data` | Redis cache | | ||
| | `future-agi_minio-data` | MinIO objects | | ||
| | `future-agi_peerdb-catalog-data` | PeerDB replication catalog | | ||
| | `future-agi_peerdb-minio-data` | PeerDB staging objects | | ||
|
|
||
| ## ClickHouse | ||
|
|
||
| ClickHouse can back up straight to S3: | ||
|
|
||
| ```sql | ||
| BACKUP TABLE default.traces TO S3('s3://your-bucket/ch-backup/', 'KEY', 'SECRET'); | ||
| ``` | ||
|
|
||
| <Note> | ||
| ClickHouse is a replica of Postgres data, streamed through PeerDB. If you lose it, rebuild it from scratch by re-running PeerDB init rather than restoring a backup. The steps are in [Upgrades & rollback](/docs/self-hosting/production/upgrades-rollback). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't true anymore. With the CH25 cutover ( |
||
| </Note> | ||
|
|
||
| ## MinIO | ||
|
|
||
| Mirror the MinIO bucket to S3 with the MinIO client: | ||
|
|
||
| ```bash | ||
| mc alias set local http://localhost:9005 futureagi <MINIO_ROOT_PASSWORD> | ||
| mc alias set s3 https://s3.amazonaws.com <AWS_KEY> <AWS_SECRET> | ||
| mc mirror local/ s3/your-bucket/ | ||
| ``` | ||
|
|
||
| If you've already moved to [managed data stores](/docs/self-hosting/production/checklist), your provider's own backup tooling replaces these commands. | ||
|
|
||
| ## Dive deeper | ||
|
|
||
| <CardGroup cols={2}> | ||
| <Card title="Monitoring" icon="gauge" href="/docs/self-hosting/production/monitoring"> | ||
| Watch store health and replication lag | ||
| </Card> | ||
| <Card title="Upgrades & rollback" icon="arrows-rotate" href="/docs/self-hosting/production/upgrades-rollback"> | ||
| Rebuild ClickHouse and roll back releases | ||
| </Card> | ||
| </CardGroup> | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| --- | ||
| title: "Checklist" | ||
| description: "The go-live pass before a self-hosted instance takes real traffic" | ||
| --- | ||
|
|
||
| Run through this once before the stack is reachable by anyone else. It covers the three things that separate a laptop trial from a real deployment: replacing the shipped secrets, switching the backend into production mode, and swapping compose-managed data stores for managed ones. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make the three things bullet points |
||
|
|
||
| ## Replace the shipped secrets | ||
|
|
||
| The stack boots with `CHANGEME` placeholders. Replace every one before the instance is reachable, and generate each value rather than making one up: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's no |
||
|
|
||
| ```bash | ||
| openssl rand -hex 32 # SECRET_KEY, AGENTCC_INTERNAL_API_KEY | ||
| openssl rand -base64 24 # PG_PASSWORD, MINIO_ROOT_PASSWORD | ||
| ``` | ||
|
|
||
| <Warning> | ||
| `PG_PASSWORD` and `MINIO_ROOT_PASSWORD` are written to their volumes on first boot only. Set them before your first `docker compose up`. Changing them after the volume exists locks you out. The full field list is in [Environment variables](/docs/self-hosting/environment). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True for Postgres, not MinIO. MinIO reads |
||
| </Warning> | ||
|
|
||
| ## Switch the backend to production mode | ||
|
|
||
| Set these runtime flags before going live: | ||
|
|
||
| | Variable | Go-live value | Why | | ||
| |---|---|---| | ||
| | `ENV_TYPE` | `prod` | Disables debug output and runs Django `check --deploy` | | ||
| | `FAST_STARTUP` | `false` | Always applies migrations on restart | | ||
| | `GRANIAN_WORKERS` | your CPU count | One worker per core, up from the default `1` | | ||
|
|
||
| To tune the gateway, PeerDB, and Temporal workers, see [System configuration](/docs/self-hosting/configuration). | ||
|
|
||
| ## Move to managed data stores | ||
|
|
||
| Compose-managed Postgres, ClickHouse, Redis, and MinIO are fine for a trial. For production, point the stack at managed services so data outlives the containers: | ||
|
|
||
| | Replace | With | Set | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These don't work from |
||
| |---|---|---| | ||
| | `postgres` | RDS, Aurora, or Cloud SQL | `PG_*` to the managed endpoint | | ||
| | `clickhouse` | ClickHouse Cloud | `CH_HOST`, `CH_PORT`, and credentials | | ||
| | `redis` | ElastiCache or Upstash | `REDIS_URL` | | ||
| | `minio` | AWS S3 | `S3_ENDPOINT_URL` plus AWS credentials | | ||
|
|
||
| <Note> | ||
| `code-executor` needs `privileged: true`, so it can't run on ECS Fargate or Cloud Run. Put it on an EC2 or GCE instance. The platform matrix is in [Requirements](/docs/self-hosting/requirements). | ||
| </Note> | ||
|
|
||
| ## Dive deeper | ||
|
|
||
| <CardGroup cols={2}> | ||
| <Card title="Security & TLS" icon="shield" href="/docs/self-hosting/production/security-tls"> | ||
| Put TLS in front of the stack and move secrets into a manager | ||
| </Card> | ||
| <Card title="Backups & restore" icon="database" href="/docs/self-hosting/production/backups-restore"> | ||
| Set up backups before the instance holds real data | ||
| </Card> | ||
| </CardGroup> | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| --- | ||
| title: "Monitoring" | ||
| description: "Scrape a self-hosted instance and watch the signals that predict trouble" | ||
| --- | ||
|
|
||
| The backend exposes Prometheus metrics, so any Prometheus-compatible stack can scrape it. This page covers wiring up the scrape and the handful of signals worth alerting on. | ||
|
|
||
| ## Scrape the backend | ||
|
|
||
| The backend serves Prometheus metrics at `http://localhost:8000/metrics`. Add it as a scrape target: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This endpoint doesn't exist. The backend has no |
||
|
|
||
| ```yaml | ||
| # prometheus.yml | ||
| scrape_configs: | ||
| - job_name: futureagi | ||
| static_configs: | ||
| - targets: ['localhost:8000'] | ||
| metrics_path: /metrics | ||
| ``` | ||
| ## Watch these signals | ||
| | Signal | Why it matters | | ||
| |---|---| | ||
| | Backend error rate | The first sign a release or dependency broke | | ||
| | Temporal workflow success and failure | Failing workflows mean evals and background jobs are stuck | | ||
| | Postgres WAL lag | Rising lag means PeerDB replication is falling behind | | ||
| | ClickHouse query latency | Slow queries surface as a slow dashboard | | ||
| | PeerDB mirror status | The health of the Postgres to ClickHouse pipeline | | ||
| PeerDB has its own console at [localhost:3001](http://localhost:3001) for mirror status and throughput. | ||
| <Note> | ||
| Postgres WAL lag and PeerDB mirror status are the two to page on first. When ClickHouse drifts from Postgres, dashboards read stale before anything visibly breaks. | ||
| </Note> | ||
| ## Dive deeper | ||
| <CardGroup cols={2}> | ||
| <Card title="Upgrades & rollback" icon="arrows-rotate" href="/docs/self-hosting/production/upgrades-rollback"> | ||
| Keep the stack current without downtime | ||
| </Card> | ||
| <Card title="Troubleshooting & FAQs" icon="wrench" href="/docs/self-hosting/troubleshooting"> | ||
| Symptoms, causes, and fixes for common errors | ||
| </Card> | ||
| </CardGroup> | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Production folder needs an Overview child pointing at
/docs/self-hosting/production, same as the Explore dashboard folder in Observe. Right now the overview page isn't reachable from the sidebar