ImageForge

Standalone, reusable local AI image-generation service. Runs fully offline on Apple Silicon (MPS) using Stable Diffusion via diffusers. No paid APIs, no cloud calls — everything runs on your machine.

Target hardware: M2 Pro, 16 GB unified memory (tested).

Two connection surfaces

Surface	Transport	Use case
HTTP API (FastAPI)	TCP on port 8765	Any app, curl, browser, Python SDK
MCP server	stdio	AI agents — Claude Code, scripts, orchestrators

Both surfaces share the same process-level singletons: the SD engine (load-once weights), the prompt-assist client (gemma4/Ollama), and the content-addressed cache.

Quick start

Fastest path — one command

cd ~/Developer/imageforge
./start.sh            # bootstrap deps into the venv, seed .env, launch HTTP API

start.sh verifies the SD venv, installs the light HTTP/MCP layer if anything is missing, seeds .env from .env.example on first run, reports whether the heavy SD stack is present, then launches. Other modes:

./start.sh api --reload   # HTTP API with hot-reload
./start.sh mcp            # MCP stdio server
./start.sh selftest       # MCP self-test (fast, no GPU)
./start.sh check          # bootstrap + dependency report, then exit

The manual steps below are the detailed equivalent if you prefer to run each part yourself (or are not on a Mac — see also the CPU-only Dockerfile).

1. Install dependencies

All SD components run under the existing venv at ~/.floor-voice-studio/venv-sd. Install the ImageForge HTTP/MCP layer into it:

cd ~/Developer/imageforge

arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/pip install \
    fastapi "uvicorn[standard]" mcp pydantic pydantic-settings httpx

Pre-installed in the venv: diffusers, torch, transformers, Pillow, accelerate.

2. Configure

cp .env.example .env
# edit .env as needed — defaults work for M2 Pro 16 GB

3. Launch the HTTP API

./run_api.sh
# with hot-reload during development:
./run_api.sh --reload

4. Launch the MCP server

./run_mcp.sh
# or directly:
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -m imageforge.mcp_server

5. (Optional) Run in Docker — CPU-only, for Linux / CI

On a Mac, prefer ./start.sh (native MPS). Docker on macOS has no MPS passthrough, so the container runs the engine CPU-only (slow — minutes per image). It exists for portability / Linux servers / CI smoke tests:

docker build -t imageforge .
docker run --rm -p 8765:8765 \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd)/outputs:/app/outputs" imageforge
curl http://127.0.0.1:8765/health

Configuration

Variable	Default	Description
`IMAGEFORGE_MODEL`	`sd-turbo`	Active model key (see model table below)
`DEVICE`	`mps`	`mps` / `cuda` / `cpu`
`PORT`	`8765`	HTTP API port
`HOST`	`127.0.0.1`	HTTP API bind address
`OLLAMA_URL`	`http://127.0.0.1:11434`	Ollama base URL for prompt-assist
`OLLAMA_MODEL`	`gemma4:latest`	Model used for prompt expansion
`OUTPUT_DIR`	`outputs/`	Where generated images land on disk
`CACHE_DIR`	`cache/`	Content-addressed cache directory
`MAX_RES`	`1024`	Maximum allowed image dimension
`ALLOW_FLUX`	`0`	Set `1` to unlock the guarded flux-q4 model
`WARMUP`	`0`	Set `1` to preload SD weights at API startup

Model policy table

Model key	Steps	Default res	Est. RAM	Est. time (M2 Pro)	Operations	Guard	Quality
`sd-turbo`	1–2	512 px	~4 GB	~2–3 s	text2img, img2img, inpaint	—	Draft
`sdxl-turbo`	4	512 px (up to 1024)	~7 GB	~8 s	text2img, img2img, inpaint	—	Good
`sdxl`	30	1024 px	~7 GB	~40 s	text2img, img2img, inpaint	—	High
`sdxl-lcm`	6	1024 px	~9 GB	~20 s	text2img, img2img, inpaint	—	Best (non-FLUX)
`sdxl-portrait`	30	1024 px	~8 GB	~40 s	text2img, img2img, inpaint	—	High + portrait LoRA
`flux-q4`	4	1024 px	~8 GB active / 16–18 GB peak	~45 s	text2img only	`ALLOW_FLUX=1`	Highest
`flux2-klein`	4	1024 px	~10 GB (int8) / 13–16 GB (bf16)	~35–55 s	text2img only	`ALLOW_KLEIN=1`	Frontier

Why sd-turbo is the default: 1-step inference at ~2 s per image; safe on any machine with 8 GB+. Use sdxl-lcm when quality matters and time allows.

sdxl vs sdxl-portrait: sdxl is plain SDXL base-1.0 (30 steps, 1024 px, no LoRA). sdxl-portrait is the same checkpoint with a fused portrait LoRA — it requires IMAGEFORGE_LORA_PORTRAIT=/path/to/lora.safetensors to be set; without it, it behaves like plain sdxl.

FLUX note: flux-q4 peaks at 16–18 GB due to GGUF loader overhead — this causes macOS to swap on a 16 GB machine. Enable only with ALLOW_FLUX=1 and accept the trade-off, or use a machine with 24 GB+.

FLUX.2-klein note: flux2-klein uses a new architecture (Qwen3-4B encoder, Flux2KleinPipeline) distinct from FLUX.1. With Quanto int8 quantisation the transformer fits ~9.4–10.4 GB; plain bf16 peaks at 13–16 GB (RED on 16 GB machines). The Phase-2 loader (kind="flux2-klein") has shipped this wave: klein_loader.load_flux2_klein is wired via pipeline.py's _load_flux2_klein behind ALLOW_KLEIN=1. The loader is no longer the blocker — loading now fails only on weight download (no network / missing HF checkpoint) or diffusers<0.38 (no Flux2KleinPipeline). Enable with ALLOW_KLEIN=1.

HTTP API

Endpoint reference

`GET /health`

Probe MPS availability and the Ollama endpoint.

curl http://127.0.0.1:8765/health

{
  "status": "ok",
  "device": "mps",
  "mps_available": true,
  "engine_constructed": true,
  "resident_models": ["sd-turbo"],
  "default_model": "sd-turbo",
  "ollama": {
    "reachable": true,
    "url": "http://127.0.0.1:11434",
    "model": "gemma4:latest",
    "model_available": true
  }
}

`GET /models`

List all registered models and their capabilities.

curl http://127.0.0.1:8765/models

`POST /generate`

Text-to-image. Returns a base64 PNG plus generation metadata.

curl -X POST http://127.0.0.1:8765/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a red fox sitting in autumn leaves, golden hour, cinematic",
    "seed": 42,
    "width": 512,
    "height": 512,
    "return_mode": "both"
  }'

With prompt-assist (gemma4 expands the short request):

curl -X POST http://127.0.0.1:8765/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a red fox",
    "assist_prompt": true,
    "model": "sdxl-turbo",
    "seed": 7
  }'

Key request fields:

Field	Type	Default	Description
`prompt`	string	required	Image description
`negative_prompt`	string	null	What to avoid
`model`	string	`sd-turbo`	Model key
`steps`	int	model default	Inference steps
`guidance`	float	model default	CFG scale (turbo models use 0)
`width` / `height`	int	model default	Must be multiples of 8
`seed`	int	random	Reproducibility seed
`assist_prompt`	bool	false	Expand via gemma4 first
`use_cache`	bool	true	Return cached result for identical seeded params
`return_mode`	string	`"both"`	`"base64"` / `"path"` / `"both"`

Response shape:

{
  "ok": true,
  "task": "text2img",
  "model": "sd-turbo",
  "seed": 42,
  "steps": 2,
  "guidance": 0.0,
  "width": 512,
  "height": 512,
  "elapsed_s": 2.871,
  "prompt": "a red fox sitting in autumn leaves, golden hour, cinematic",
  "assisted": false,
  "cached": false,
  "output_path": "$(pwd)/outputs/20260610_075828_521b48.png",
  "image_base64": "<base64 PNG data>",
  "mime": "image/png"
}

`POST /edit`

Image-to-image. Alter an existing image guided by a prompt. Supply the source image as base64 (init_image) or a local path (init_image_path).

# Path-based (simpler for local consumers)
curl -X POST http://127.0.0.1:8765/edit \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "same fox but in a snowy winter forest",
    "init_image_path": "$(pwd)/outputs/selftest.png",
    "strength": 0.65,
    "seed": 100
  }'

strength: 0.0 = keep the original, 1.0 = full regeneration. Default 0.6.

`POST /inpaint`

Regenerate only the white regions of a mask within the source image.

curl -X POST http://127.0.0.1:8765/inpaint \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a bright orange pumpkin lantern",
    "init_image_path": "/path/to/source.png",
    "mask_image_path": "/path/to/mask.png",
    "strength": 0.85
  }'

Mask convention: white pixels = repaint, black pixels = keep.

`GET /docs`

Interactive OpenAPI docs (Swagger UI) — open http://127.0.0.1:8765/docs in a browser.

Python `httpx` snippet (voice-studio consumer pattern)

import httpx, base64, pathlib

IMAGEFORGE = "http://127.0.0.1:8765"

async def generate_avatar(prompt: str, seed: int = 42) -> pathlib.Path:
    """Generate an avatar image and return the local path."""
    async with httpx.AsyncClient(timeout=120) as client:
        r = await client.post(
            f"{IMAGEFORGE}/generate",
            json={
                "prompt": prompt,
                "model": "sd-turbo",
                "seed": seed,
                "width": 512,
                "height": 512,
                "return_mode": "both",
                "assist_prompt": False,
            },
        )
        r.raise_for_status()
        data = r.json()

    # The output file is already on disk at data["output_path"].
    # Optionally decode the inline base64 if you need the bytes in-process:
    png_bytes = base64.b64decode(data["image_base64"])

    out = pathlib.Path(data["output_path"])
    print(f"avatar generated: {out}  ({data['elapsed_s']:.1f}s, seed={data['seed']})")
    return out


async def edit_avatar(source_path: str, prompt: str) -> pathlib.Path:
    """Alter an existing avatar."""
    async with httpx.AsyncClient(timeout=120) as client:
        r = await client.post(
            f"{IMAGEFORGE}/edit",
            json={
                "prompt": prompt,
                "init_image_path": source_path,
                "strength": 0.65,
                "return_mode": "both",
            },
        )
        r.raise_for_status()
        data = r.json()
    return pathlib.Path(data["output_path"])


# Synchronous wrapper for non-async callers:
import asyncio

def generate_avatar_sync(prompt: str, seed: int = 42) -> pathlib.Path:
    return asyncio.run(generate_avatar(prompt, seed))

MCP server

The MCP server exposes ImageForge as Model-Context-Protocol tools over stdio. Logs go to stderr; stdout is the protocol channel. It is identical to the HTTP surface in capability — it shares the same engine, cache, and prompt-assist singletons.

Claude Code registration

Add to your project's .mcp.json (or to ~/.claude/settings.json under "mcpServers"):

{
  "mcpServers": {
    "imageforge": {
      "command": "arch",
      "args": [
        "-arm64",
        "$HOME/.floor-voice-studio/venv-sd/bin/python",
        "-m",
        "imageforge.mcp_server"
      ],
      "cwd": "~/Developer/imageforge"
    }
  }
}

Alternative (via run_mcp.sh):

{
  "mcpServers": {
    "imageforge": {
      "command": "~/Developer/imageforge/run_mcp.sh",
      "args": [],
      "cwd": "~/Developer/imageforge"
    }
  }
}

MCP tool list

Tool	Required args	Optional args	Returns
`generate_image`	`prompt`	`model`, `steps`, `guidance`, `width`, `height`, `seed`, `negative_prompt`, `assist`	`ImageContent` (base64 PNG) + `TextContent` (JSON metadata)
`edit_image`	`prompt`, `image_path`	`strength`, `model`, `steps`, `guidance`, `seed`, `negative_prompt`, `assist`	Edited PNG + JSON metadata
`inpaint_image`	`prompt`, `image_path`, `mask_path`	`strength`, `model`, `steps`, `guidance`, `seed`, `negative_prompt`, `assist`	Result PNG + JSON metadata
`assist_prompt`	`prompt`	`style`	`TextContent` — expanded prompt + tags
`list_models`	—	—	`TextContent` — JSON model registry

Image paths for edit_image / inpaint_image: pass absolute paths or project-relative paths (e.g. outputs/foo.png). The server resolves relative paths against the project root.

assist=true on generation tools calls gemma4/Ollama to expand a short request into a richer SD prompt. Transparent fallback to original prompt if Ollama is unavailable.

Example tool call (Claude Code session)

When imageforge is registered, Claude Code can call tools like:

<use_mcp_tool>
  <server_name>imageforge</server_name>
  <tool_name>generate_image</tool_name>
  <arguments>
    {
      "prompt": "friendly robot avatar, flat design, vibrant colors",
      "model": "sd-turbo",
      "seed": 99,
      "width": 512,
      "height": 512,
      "assist": true
    }
  </arguments>
</use_mcp_tool>

The response includes the image inline as ImageContent (rendered in-chat) plus a TextContent block with the metadata JSON:

{
  "task": "text2img",
  "model": "sd-turbo",
  "seed": 99,
  "steps": 2,
  "guidance": 0.0,
  "width": 512,
  "height": 512,
  "elapsed_s": 2.91,
  "prompt": "friendly robot avatar, ...",
  "output_path": "$(pwd)/outputs/20260610_XXXXXX_XXXXXX.png"
}

MCP self-test (no MCP client needed)

# Handshake + tool discovery + list_models + assist (no GPU):
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python \
    -m imageforge.mcp_server --selftest

# Also run a real generation through the MCP path (loads SD, ~15 s):
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python \
    -m imageforge.mcp_server --selftest --gen

Expected output:

[selftest] initialize ok: imageforge v0.1.0
[selftest] list_tools -> ['generate_image', 'edit_image', 'inpaint_image', 'assist_prompt', 'list_models']
[selftest] list_models ok (sd-turbo present)
[selftest] assist_prompt ok: {"prompt": ..., "assisted": false, "source": "passthrough"}
[selftest] generate_image (this loads SD; may take a while)...
[selftest] generate_image -> content kinds: ['image', 'text']
[selftest] generate_image ok: 24871 bytes PNG, mime=image/png
[selftest] ALL PASS

Unit tests

# MCP layer tests:
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -m unittest tests.test_mcp

# Include the real SD generation test (loads diffusers, ~15 s):
IMAGEFORGE_TEST_GEN=1 arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python \
    -m unittest tests.test_mcp

# All tests:
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -m pytest tests/

Proven end-to-end result

A real smoke test was run on 2026-06-10 (M2 Pro, 16 GB, macOS):

Smoke test output (outputs/20260610_075828_521b48.json):

{
  "saved_at_iso": "2026-06-10T07:58:28Z",
  "filename": "20260610_075828_521b48.png",
  "seed": 42,
  "model": "sd-turbo",
  "task": "text2img",
  "steps": 2,
  "guidance": 0.0,
  "width": 256,
  "height": 256,
  "elapsed_s": 15.432,
  "prompt": "a red fox in autumn leaves"
}

Three generated images, one edited image, and one inpainted image are present in outputs/ as proof. The MCP self-test (--selftest --gen) exercised the full MCP handshake, tool discovery, and SD generation through the MCP path.

How to add a new consumer

ImageForge is designed so adding a new consumer is three steps:

Decide which surface to use: HTTP for any process/language, MCP for AI agents.
Point at the service: For HTTP, import httpx and call POST /generate. For MCP, add the .mcp.json registration block above.
No engine changes needed: The shared Engine, cache, and prompt-assist singletons are already running. Your consumer pays zero additional RAM cost for the model weights — they are already loaded.

Example: a new Python agent using the HTTP surface

# In any Python process on the same machine
import httpx

async def make_image(description: str) -> bytes:
    async with httpx.AsyncClient(timeout=120) as client:
        r = await client.post(
            "http://127.0.0.1:8765/generate",
            json={"prompt": description, "return_mode": "base64"},
        )
        r.raise_for_status()
        import base64
        return base64.b64decode(r.json()["image_base64"])

Example: a new agent using the MCP surface

Add to the agent's MCP config:

"imageforge": {
  "command": "arch",
  "args": ["-arm64", "$HOME/.floor-voice-studio/venv-sd/bin/python",
           "-m", "imageforge.mcp_server"],
  "cwd": "~/Developer/imageforge"
}

Then call generate_image / edit_image / inpaint_image as MCP tools.

Probe scripts

Two standalone measurement scripts live at the project root (not under research/):

Script	Purpose	Run command
`probe_klein_mps.py`	Measure FLUX.2-klein-4B peak RSS + latency on MPS (gates D8 migration decision)	`PROBE_DOWNLOAD=1 arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python probe_klein_mps.py`
`probe_dit_throughput.py`	Measure from-scratch DiT steps/sec at 256/512 px (gates Phase-3 viability)	`arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python probe_dit_throughput.py --resolution 256 --steps 100`

Both scripts require a CUDA machine (4090) for the DiT throughput probe and a 16 GB Mac for the klein MPS-fit probe. See research/OPERATION.md for full gate criteria.

Project layout

imageforge/
├── imageforge/
│   ├── settings.py            # Global constants (PROJECT_ROOT, OLLAMA_URL, etc.)
│   ├── config.py              # Pydantic-settings singleton (get_settings())
│   ├── api/
│   │   ├── app.py             # FastAPI application (all routes + lifespan)
│   │   ├── server.py          # Compatibility shim (re-exports app)
│   │   ├── prompt_bridge.py   # Async bridge to prompt-assist for HTTP handlers
│   │   └── schemas.py         # Shared Pydantic schemas
│   ├── engine/
│   │   ├── pipeline.py        # Core SD engine (load-once, MPS-locked inference)
│   │   ├── models.py          # Model registry (specs + resolve_model)
│   │   └── cache.py           # Content-addressed cache + outputs persistence
│   ├── mcp/
│   │   └── server.py          # MCP server (re-export of mcp_server.py)
│   ├── mcp_server.py          # MCP stdio server (canonical entrypoint)
│   ├── prompt/                # Prompt-assist module namespace
│   └── services/
│       └── prompt_assist.py   # Ollama/gemma4 async client (expand + tag)
├── models/                    # Downloaded HF model weights (git-ignored)
├── outputs/                   # Generated images (git-ignored)
├── cache/                     # Content-addressed PNG + JSON cache (git-ignored)
├── tests/                     # pytest test suite
├── probe_klein_mps.py         # MPS-fit probe for FLUX.2-klein (run at repo root)
├── probe_dit_throughput.py    # DiT throughput probe (run at repo root)
├── eval_harness.py            # Evaluation harness (run at repo root)
├── bench_flux_lora_latency.py # FLUX LoRA latency benchmark (run at repo root)
├── recaption.py               # Image recaptioning utility (run at repo root)
├── run_api.sh                 # Launch HTTP API
├── run_mcp.sh                 # Launch MCP server
├── .env.example               # Config template
└── pyproject.toml

Troubleshooting

MPS OOM / generation crashes

Symptom: Python crashes or produces a RuntimeError: MPS out of memory during generation.
Fix:
1. Reduce image size: "width": 256, "height": 256.
2. Switch to a lighter model: IMAGEFORGE_MODEL=sd-turbo.
3. Close memory-intensive apps (browsers, Xcode).
4. Do NOT enable flux-q4 on 16 GB — peak memory is 16–18 GB.
5. Restart the service after an OOM to free GPU memory.

Model not downloading / HuggingFace errors

Symptom: OSError: ... is not a local folder and is not a valid model identifier.
Fix: Models are downloaded from HuggingFace on first use. Ensure internet access for the first run; subsequent runs are fully offline. Downloaded weights land in ~/.cache/huggingface/. Check HF_HOME if disk space is a concern.

Verify the venv has huggingface_hub installed:

arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -c "import huggingface_hub; print('ok')"

Ollama unreachable / prompt-assist disabled

Symptom: assist_prompt returns "assisted": false or /health shows "ollama": {"reachable": false}.
This is not fatal. Prompt-assist gracefully degrades — generation continues with the original prompt unchanged.
Fix: Start Ollama (ollama serve) and pull the model:
```
ollama pull gemma4:latest
```
Verify:
```
curl http://127.0.0.1:11434/api/tags
```

FLUX guarded — cannot load flux-q4

Symptom: RuntimeError: Model 'flux-q4' is guarded: ....
Fix: Set ALLOW_FLUX=1 in .env or environment. Read the guard reason first — 16 GB machines will swap.

Port already in use

lsof -i :8765
kill <PID>
./run_api.sh

Or change PORT in .env.

MCP server produces no output / hangs

The MCP server writes logs to stderr and the protocol to stdout. If you run it interactively you will see no stdout output — that is correct. Use --selftest to verify it works without a real MCP client.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
cloud		cloud
console_design		console_design
imageforge		imageforge
research		research
tests		tests
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
bench_flux_lora_latency.py		bench_flux_lora_latency.py
eval_harness.py		eval_harness.py
probe_dit_throughput.py		probe_dit_throughput.py
probe_klein_mps.py		probe_klein_mps.py
prompts_heldout.jsonl		prompts_heldout.jsonl
pyproject.toml		pyproject.toml
recaption.py		recaption.py
requirements.txt		requirements.txt
run_api.sh		run_api.sh
run_console.sh		run_console.sh
run_mcp.sh		run_mcp.sh
start.bat		start.bat
start.ps1		start.ps1
start.sh		start.sh

Folders and files

Latest commit

History

Repository files navigation

ImageForge

Two connection surfaces

Quick start

Fastest path — one command

1. Install dependencies

2. Configure

3. Launch the HTTP API

4. Launch the MCP server

5. (Optional) Run in Docker — CPU-only, for Linux / CI

Configuration

Model policy table

HTTP API

Endpoint reference

GET /health

GET /models

POST /generate

POST /edit

POST /inpaint

GET /docs

Python httpx snippet (voice-studio consumer pattern)

MCP server

Claude Code registration

MCP tool list

Example tool call (Claude Code session)

MCP self-test (no MCP client needed)

Unit tests

Proven end-to-end result

How to add a new consumer

Probe scripts

Project layout

Troubleshooting

MPS OOM / generation crashes

Model not downloading / HuggingFace errors

Ollama unreachable / prompt-assist disabled

FLUX guarded — cannot load flux-q4

Port already in use

MCP server produces no output / hangs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`GET /models`

`POST /generate`

`POST /edit`

`POST /inpaint`

`GET /docs`

Python `httpx` snippet (voice-studio consumer pattern)

Packages