Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 21 additions & 7 deletions src/lib/navigation.ts
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,27 @@ export const tabNavigation: NavTab[] = [
title: 'Self-Hosting',
items: [
{ title: 'Overview', href: '/docs/self-hosting' },
{ title: 'System requirements', href: '/docs/self-hosting/requirements' },
{ title: 'Environment variables', href: '/docs/self-hosting/environment' },
{ title: 'Configuration', href: '/docs/self-hosting/configuration' },
{ title: 'Docker Compose', href: '/docs/self-hosting/docker-compose' },
{ title: 'Production', href: '/docs/self-hosting/production' },
{ title: 'User management', href: '/docs/self-hosting/user-management' },
{ title: 'Troubleshooting and FAQs', href: '/docs/self-hosting/troubleshooting' },
{ title: 'Requirements', href: '/docs/self-hosting/requirements' },
{ title: 'Install', href: '/docs/self-hosting/docker-compose' },
{
title: 'Configuration',
items: [
{ title: 'Environment variables', href: '/docs/self-hosting/environment' },
{ title: 'System configuration', href: '/docs/self-hosting/configuration' },
]
},
{
title: 'Production',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production folder needs an Overview child pointing at /docs/self-hosting/production, same as the Explore dashboard folder in Observe. Right now the overview page isn't reachable from the sidebar

items: [
{ title: 'Checklist', href: '/docs/self-hosting/production/checklist' },
{ title: 'Security & TLS', href: '/docs/self-hosting/production/security-tls' },
{ title: 'Backups & restore', href: '/docs/self-hosting/production/backups-restore' },
{ title: 'Monitoring', href: '/docs/self-hosting/production/monitoring' },
{ title: 'Upgrades & rollback', href: '/docs/self-hosting/production/upgrades-rollback' },
]
},
{ title: 'Troubleshooting & FAQs', href: '/docs/self-hosting/troubleshooting' },
{ title: 'Support', href: '/docs/self-hosting/support' },
]
},
{
Expand Down
145 changes: 18 additions & 127 deletions src/pages/docs/self-hosting/production.mdx
Original file line number Diff line number Diff line change
@@ -1,137 +1,28 @@
---
title: "Production Hardening & Operations"
description: "Production readiness checklist — replace secrets, configure TLS, set up managed data stores, run Postgres/ClickHouse/MinIO backups, and follow the upgrade runbook."
title: "Production"
description: "What to harden before a self-hosted Future AGI instance goes live"
---

## About
Everything past a local trial happens here. The default Docker Compose stack boots with placeholder secrets, no TLS, and compose-managed data stores. That's fine on a laptop. Before real traffic reaches the instance, work through the flow below in order, then keep each page as a runbook.

Run through this before exposing the stack to real users. Covers secrets, TLS, swapping in managed data stores, backup commands for Postgres/ClickHouse/MinIO, Prometheus monitoring, and the upgrade and rollback runbook.
## In this page

## Hardening checklist

**Secrets** — replace all `CHANGEME` values before going live:

```bash
openssl rand -hex 32 # SECRET_KEY, AGENTCC_INTERNAL_API_KEY
openssl rand -base64 24 # PG_PASSWORD, MINIO_ROOT_PASSWORD
```

**Runtime flags** in `.env`:
- `ENV_TYPE=prod`
- `FAST_STARTUP=false`
- `GRANIAN_WORKERS=<your CPU count>`

**TLS** — the frontend and backend don't terminate TLS. Put Caddy, nginx, or Traefik in front:

```
# Caddyfile (simplest — auto-issues Let's Encrypt certs)
app.yourcompany.com { reverse_proxy localhost:3000 }
api.yourcompany.com { reverse_proxy localhost:8000 }
```

After setting up TLS, set `VITE_HOST_API=https://api.yourcompany.com` in `.env` and rebuild:

```bash
docker compose build frontend && docker compose up -d frontend
```

**Managed data stores** — for production, replace compose-managed services:

| Replace | With | Change |
|---|---|---|
| `postgres` | RDS / Aurora / Cloud SQL | Set `PG_*` vars to managed endpoint |
| `clickhouse` | ClickHouse Cloud | Set `CH_HOST`, `CH_PORT`, etc. |
| `redis` | ElastiCache / Upstash | Set `REDIS_URL` |
| `minio` | AWS S3 | Set `S3_ENDPOINT_URL=https://s3.amazonaws.com` + AWS creds |

<Note>
`code-executor` requires `privileged: true`. Run on EC2 / GCE instances — not Fargate or Cloud Run.
</Note>

**Secrets manager** — use AWS Secrets Manager, HashiCorp Vault, or GCP Secret Manager instead of a plain `.env` file.

---

## Backups

### PostgreSQL

```bash
# Backup
docker compose exec postgres \
pg_dump -U futureagi -d futureagi --format=custom \
> backup-$(date +%F).dump

# Restore
docker compose exec -T postgres \
pg_restore -U futureagi -d futureagi --clean --if-exists \
< backup-2026-04-22.dump
```

Volumes: `future-agi_postgres-data` · `future-agi_clickhouse-data` · `future-agi_redis-data` · `future-agi_minio-data` · `future-agi_peerdb-catalog-data` · `future-agi_peerdb-minio-data`

### ClickHouse

```sql
BACKUP TABLE default.traces TO S3('s3://your-bucket/ch-backup/', 'KEY', 'SECRET');
```

ClickHouse data can also be rebuilt from scratch by re-running PeerDB init since it replicates from Postgres.

### MinIO

```bash
mc alias set local http://localhost:9005 futureagi <MINIO_ROOT_PASSWORD>
mc alias set s3 https://s3.amazonaws.com <AWS_KEY> <AWS_SECRET>
mc mirror local/ s3/your-bucket/
```

---

## Monitoring

Backend exposes Prometheus metrics at `http://localhost:8000/metrics`. Add a scraper:

```yaml
# prometheus.yml
scrape_configs:
- job_name: futureagi
static_configs:
- targets: ['localhost:8000']
metrics_path: /metrics
```

Key signals: backend error rate, Temporal workflow success/failure, Postgres WAL lag (PeerDB health), ClickHouse query latency, PeerDB mirror status at [localhost:3001](http://localhost:3001).

---

## Upgrades

```bash
git pull
docker compose build
docker compose up -d
```

Migrations run automatically. If a migration fails: `docker compose exec backend python manage.py migrate`

If release notes mention PeerDB changes: `docker compose run --rm peerdb-init bash /setup.sh`

**Rollback:**

```bash
git log --oneline -5
git checkout <previous-hash>
docker compose build && docker compose up -d
```

## Next Steps
Production readiness for a self-hosted instance breaks into five steps. Do them in order the first time.

<CardGroup cols={2}>
<Card title="Troubleshooting" icon="wrench" href="/docs/self-hosting/troubleshooting">
Symptoms, causes, and fixes for common errors.
<Card title="Checklist" icon="list-check" href="/docs/self-hosting/production/checklist">
The go-live pass: secrets, prod runtime flags, and managed data stores
</Card>
<Card title="Security & TLS" icon="shield" href="/docs/self-hosting/production/security-tls">
Terminate TLS in front of the stack and lock down secrets
</Card>
<Card title="Backups & restore" icon="database" href="/docs/self-hosting/production/backups-restore">
Back up and restore Postgres, ClickHouse, and MinIO
</Card>
<Card title="Monitoring" icon="gauge" href="/docs/self-hosting/production/monitoring">
Scrape Prometheus metrics and watch the signals that matter
</Card>
<Card title="System Configuration" icon="sliders" href="/docs/self-hosting/configuration">
Tune the LLM gateway, PeerDB mirrors, and Temporal workers.
<Card title="Upgrades & rollback" icon="arrows-rotate" href="/docs/self-hosting/production/upgrades-rollback">
Pull a release, run migrations, and roll back safely
</Card>
</CardGroup>
68 changes: 68 additions & 0 deletions src/pages/docs/self-hosting/production/backups-restore.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
title: "Backups & restore"
description: "Back up and restore the data stores behind a self-hosted instance"
---

A self-hosted instance keeps state in four stores: Postgres for application data, ClickHouse for observability records, MinIO for object storage, and Redis for cache and queues. This page covers backing up and restoring the three that hold durable data. Redis is a cache and doesn't need a backup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Cache and queues' at the start, 'a cache' at the end. If it holds queues, losing it drops queued work. Either say it's safe to lose because it rebuilds, or drop 'queues'


## Postgres

Postgres holds the application data, so back it up on a schedule. Use the custom format so restores can run selectively:

```bash
# Backup
docker compose exec postgres \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add -T on the backup too. compose exec allocates a tty by default and it mangles the binary dump

pg_dump -U futureagi -d futureagi --format=custom \
> backup-$(date +%F).dump

# Restore
docker compose exec -T postgres \
pg_restore -U futureagi -d futureagi --clean --if-exists \
< backup-2026-04-22.dump
```

The Docker volumes that hold state:

| Volume | Holds |
|---|---|
| `future-agi_postgres-data` | Postgres application data |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong prefix. The compose file sets name: futureagi, so the volumes are futureagi_postgres-data etc. Also missing rabbitmq-data and fi-collector-data if this is the volumes-that-hold-state table

| `future-agi_clickhouse-data` | ClickHouse records |
| `future-agi_redis-data` | Redis cache |
| `future-agi_minio-data` | MinIO objects |
| `future-agi_peerdb-catalog-data` | PeerDB replication catalog |
| `future-agi_peerdb-minio-data` | PeerDB staging objects |

## ClickHouse

ClickHouse can back up straight to S3:

```sql
BACKUP TABLE default.traces TO S3('s3://your-bucket/ch-backup/', 'KEY', 'SECRET');
```

<Note>
ClickHouse is a replica of Postgres data, streamed through PeerDB. If you lose it, rebuild it from scratch by re-running PeerDB init rather than restoring a backup. The steps are in [Upgrades & rollback](/docs/self-hosting/production/upgrades-rollback).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true anymore. With the CH25 cutover (CH25_DROP_LEGACY_CDC_CHAIN defaults to true in compose), fi-collector writes spans straight to ClickHouse and Django dual-writes traces. None of that comes back from PeerDB init, so losing ClickHouse loses the observability data. Flip this note: ClickHouse needs a real backup, and the command above should cover spans too (or just BACKUP DATABASE default)

</Note>

## MinIO

Mirror the MinIO bucket to S3 with the MinIO client:

```bash
mc alias set local http://localhost:9005 futureagi <MINIO_ROOT_PASSWORD>
mc alias set s3 https://s3.amazonaws.com <AWS_KEY> <AWS_SECRET>
mc mirror local/ s3/your-bucket/
```

If you've already moved to [managed data stores](/docs/self-hosting/production/checklist), your provider's own backup tooling replaces these commands.

## Dive deeper

<CardGroup cols={2}>
<Card title="Monitoring" icon="gauge" href="/docs/self-hosting/production/monitoring">
Watch store health and replication lag
</Card>
<Card title="Upgrades & rollback" icon="arrows-rotate" href="/docs/self-hosting/production/upgrades-rollback">
Rebuild ClickHouse and roll back releases
</Card>
</CardGroup>
57 changes: 57 additions & 0 deletions src/pages/docs/self-hosting/production/checklist.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: "Checklist"
description: "The go-live pass before a self-hosted instance takes real traffic"
---

Run through this once before the stack is reachable by anyone else. It covers the three things that separate a laptop trial from a real deployment: replacing the shipped secrets, switching the backend into production mode, and swapping compose-managed data stores for managed ones.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the three things bullet points


## Replace the shipped secrets

The stack boots with `CHANGEME` placeholders. Replace every one before the instance is reachable, and generate each value rather than making one up:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no CHANGEME in the repo. The compose defaults are local-dev-only-...-replace-me (and futureagi for the passwords), and the real guard is deploy/docker-compose.production.yml, which re-binds these with ${VAR:?} so prod refuses to boot until they're set. Rewrite this section around that overlay. Its required list is also longer than these four: SECRET_KEY, AGENTCC_INTERNAL_API_KEY, AGENTCC_ADMIN_TOKEN, PG_PASSWORD, MINIO_ROOT_PASSWORD, RABBITMQ_USER/PASSWORD, FRONTEND_URL


```bash
openssl rand -hex 32 # SECRET_KEY, AGENTCC_INTERNAL_API_KEY
openssl rand -base64 24 # PG_PASSWORD, MINIO_ROOT_PASSWORD
```

<Warning>
`PG_PASSWORD` and `MINIO_ROOT_PASSWORD` are written to their volumes on first boot only. Set them before your first `docker compose up`. Changing them after the volume exists locks you out. The full field list is in [Environment variables](/docs/self-hosting/environment).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True for Postgres, not MinIO. MinIO reads MINIO_ROOT_PASSWORD from env on every boot, so changing it and restarting just works. Scope this to PG_PASSWORD

</Warning>

## Switch the backend to production mode

Set these runtime flags before going live:

| Variable | Go-live value | Why |
|---|---|---|
| `ENV_TYPE` | `prod` | Disables debug output and runs Django `check --deploy` |
| `FAST_STARTUP` | `false` | Always applies migrations on restart |
| `GRANIAN_WORKERS` | your CPU count | One worker per core, up from the default `1` |

To tune the gateway, PeerDB, and Temporal workers, see [System configuration](/docs/self-hosting/configuration).

## Move to managed data stores

Compose-managed Postgres, ClickHouse, Redis, and MinIO are fine for a trial. For production, point the stack at managed services so data outlives the containers:

| Replace | With | Set |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't work from .env. CH_HOST/CH_PORT, REDIS_URL, S3_ENDPOINT_URL (and PG_HOST) are hardcoded in the backend env block of docker-compose.yml, so setting them in .env does nothing. You have to edit the compose file, and the page should say so. Also for S3 the actual switch is STORAGE_BACKEND=s3 per the compose comments, not the endpoint URL

|---|---|---|
| `postgres` | RDS, Aurora, or Cloud SQL | `PG_*` to the managed endpoint |
| `clickhouse` | ClickHouse Cloud | `CH_HOST`, `CH_PORT`, and credentials |
| `redis` | ElastiCache or Upstash | `REDIS_URL` |
| `minio` | AWS S3 | `S3_ENDPOINT_URL` plus AWS credentials |

<Note>
`code-executor` needs `privileged: true`, so it can't run on ECS Fargate or Cloud Run. Put it on an EC2 or GCE instance. The platform matrix is in [Requirements](/docs/self-hosting/requirements).
</Note>

## Dive deeper

<CardGroup cols={2}>
<Card title="Security & TLS" icon="shield" href="/docs/self-hosting/production/security-tls">
Put TLS in front of the stack and move secrets into a manager
</Card>
<Card title="Backups & restore" icon="database" href="/docs/self-hosting/production/backups-restore">
Set up backups before the instance holds real data
</Card>
</CardGroup>
46 changes: 46 additions & 0 deletions src/pages/docs/self-hosting/production/monitoring.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: "Monitoring"
description: "Scrape a self-hosted instance and watch the signals that predict trouble"
---

The backend exposes Prometheus metrics, so any Prometheus-compatible stack can scrape it. This page covers wiring up the scrape and the handful of signals worth alerting on.

## Scrape the backend

The backend serves Prometheus metrics at `http://localhost:8000/metrics`. Add it as a scrape target:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint doesn't exist. The backend has no /metrics route (nothing in tfc/urls.py, and granian isn't started with metrics), and fi-collector's admin port (9464) only serves /healthz; its README still lists the metrics exporter as a TODO. There's nothing to scrape yet. Rework the page around what actually exists (/healthz checks, the PeerDB UI on 3001, container-level monitoring), or hold it until a metrics endpoint lands


```yaml
# prometheus.yml
scrape_configs:
- job_name: futureagi
static_configs:
- targets: ['localhost:8000']
metrics_path: /metrics
```
## Watch these signals
| Signal | Why it matters |
|---|---|
| Backend error rate | The first sign a release or dependency broke |
| Temporal workflow success and failure | Failing workflows mean evals and background jobs are stuck |
| Postgres WAL lag | Rising lag means PeerDB replication is falling behind |
| ClickHouse query latency | Slow queries surface as a slow dashboard |
| PeerDB mirror status | The health of the Postgres to ClickHouse pipeline |
PeerDB has its own console at [localhost:3001](http://localhost:3001) for mirror status and throughput.
<Note>
Postgres WAL lag and PeerDB mirror status are the two to page on first. When ClickHouse drifts from Postgres, dashboards read stale before anything visibly breaks.
</Note>
## Dive deeper
<CardGroup cols={2}>
<Card title="Upgrades & rollback" icon="arrows-rotate" href="/docs/self-hosting/production/upgrades-rollback">
Keep the stack current without downtime
</Card>
<Card title="Troubleshooting & FAQs" icon="wrench" href="/docs/self-hosting/troubleshooting">
Symptoms, causes, and fixes for common errors
</Card>
</CardGroup>
Loading
Loading