Standalone, reusable local AI image-generation service.
Runs fully offline on Apple Silicon (MPS) using Stable Diffusion via diffusers.
No paid APIs, no cloud calls — everything runs on your machine.
Target hardware: M2 Pro, 16 GB unified memory (tested).
| Surface | Transport | Use case |
|---|---|---|
| HTTP API (FastAPI) | TCP on port 8765 | Any app, curl, browser, Python SDK |
| MCP server | stdio | AI agents — Claude Code, scripts, orchestrators |
Both surfaces share the same process-level singletons: the SD engine (load-once weights), the prompt-assist client (gemma4/Ollama), and the content-addressed cache.
cd ~/Developer/imageforge
./start.sh # bootstrap deps into the venv, seed .env, launch HTTP APIstart.sh verifies the SD venv, installs the light HTTP/MCP layer if anything is
missing, seeds .env from .env.example on first run, reports whether the heavy
SD stack is present, then launches. Other modes:
./start.sh api --reload # HTTP API with hot-reload
./start.sh mcp # MCP stdio server
./start.sh selftest # MCP self-test (fast, no GPU)
./start.sh check # bootstrap + dependency report, then exitThe manual steps below are the detailed equivalent if you prefer to run each part
yourself (or are not on a Mac — see also the CPU-only Dockerfile).
All SD components run under the existing venv at ~/.floor-voice-studio/venv-sd.
Install the ImageForge HTTP/MCP layer into it:
cd ~/Developer/imageforge
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/pip install \
fastapi "uvicorn[standard]" mcp pydantic pydantic-settings httpxPre-installed in the venv: diffusers, torch, transformers, Pillow, accelerate.
cp .env.example .env
# edit .env as needed — defaults work for M2 Pro 16 GB./run_api.sh
# with hot-reload during development:
./run_api.sh --reload./run_mcp.sh
# or directly:
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -m imageforge.mcp_serverOn a Mac, prefer ./start.sh (native MPS). Docker on macOS has no MPS
passthrough, so the container runs the engine CPU-only (slow — minutes per
image). It exists for portability / Linux servers / CI smoke tests:
docker build -t imageforge .
docker run --rm -p 8765:8765 \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-v "$(pwd)/outputs:/app/outputs" imageforge
curl http://127.0.0.1:8765/health| Variable | Default | Description |
|---|---|---|
IMAGEFORGE_MODEL |
sd-turbo |
Active model key (see model table below) |
DEVICE |
mps |
mps / cuda / cpu |
PORT |
8765 |
HTTP API port |
HOST |
127.0.0.1 |
HTTP API bind address |
OLLAMA_URL |
http://127.0.0.1:11434 |
Ollama base URL for prompt-assist |
OLLAMA_MODEL |
gemma4:latest |
Model used for prompt expansion |
OUTPUT_DIR |
outputs/ |
Where generated images land on disk |
CACHE_DIR |
cache/ |
Content-addressed cache directory |
MAX_RES |
1024 |
Maximum allowed image dimension |
ALLOW_FLUX |
0 |
Set 1 to unlock the guarded flux-q4 model |
WARMUP |
0 |
Set 1 to preload SD weights at API startup |
| Model key | Steps | Default res | Est. RAM | Est. time (M2 Pro) | Operations | Guard | Quality |
|---|---|---|---|---|---|---|---|
sd-turbo |
1–2 | 512 px | ~4 GB | ~2–3 s | text2img, img2img, inpaint | — | Draft |
sdxl-turbo |
4 | 512 px (up to 1024) | ~7 GB | ~8 s | text2img, img2img, inpaint | — | Good |
sdxl |
30 | 1024 px | ~7 GB | ~40 s | text2img, img2img, inpaint | — | High |
sdxl-lcm |
6 | 1024 px | ~9 GB | ~20 s | text2img, img2img, inpaint | — | Best (non-FLUX) |
sdxl-portrait |
30 | 1024 px | ~8 GB | ~40 s | text2img, img2img, inpaint | — | High + portrait LoRA |
flux-q4 |
4 | 1024 px | ~8 GB active / 16–18 GB peak | ~45 s | text2img only | ALLOW_FLUX=1 |
Highest |
flux2-klein |
4 | 1024 px | ~10 GB (int8) / 13–16 GB (bf16) | ~35–55 s | text2img only | ALLOW_KLEIN=1 |
Frontier |
Why sd-turbo is the default: 1-step inference at ~2 s per image; safe on any machine with 8 GB+. Use sdxl-lcm when quality matters and time allows.
sdxl vs sdxl-portrait: sdxl is plain SDXL base-1.0 (30 steps, 1024 px, no LoRA). sdxl-portrait is the same checkpoint with a fused portrait LoRA — it requires IMAGEFORGE_LORA_PORTRAIT=/path/to/lora.safetensors to be set; without it, it behaves like plain sdxl.
FLUX note: flux-q4 peaks at 16–18 GB due to GGUF loader overhead — this causes macOS to swap on a 16 GB machine. Enable only with ALLOW_FLUX=1 and accept the trade-off, or use a machine with 24 GB+.
FLUX.2-klein note: flux2-klein uses a new architecture (Qwen3-4B encoder, Flux2KleinPipeline) distinct from FLUX.1. With Quanto int8 quantisation the transformer fits ~9.4–10.4 GB; plain bf16 peaks at 13–16 GB (RED on 16 GB machines). The Phase-2 loader (kind="flux2-klein") has shipped this wave: klein_loader.load_flux2_klein is wired via pipeline.py's _load_flux2_klein behind ALLOW_KLEIN=1. The loader is no longer the blocker — loading now fails only on weight download (no network / missing HF checkpoint) or diffusers<0.38 (no Flux2KleinPipeline). Enable with ALLOW_KLEIN=1.
Probe MPS availability and the Ollama endpoint.
curl http://127.0.0.1:8765/health{
"status": "ok",
"device": "mps",
"mps_available": true,
"engine_constructed": true,
"resident_models": ["sd-turbo"],
"default_model": "sd-turbo",
"ollama": {
"reachable": true,
"url": "http://127.0.0.1:11434",
"model": "gemma4:latest",
"model_available": true
}
}List all registered models and their capabilities.
curl http://127.0.0.1:8765/modelsText-to-image. Returns a base64 PNG plus generation metadata.
curl -X POST http://127.0.0.1:8765/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "a red fox sitting in autumn leaves, golden hour, cinematic",
"seed": 42,
"width": 512,
"height": 512,
"return_mode": "both"
}'With prompt-assist (gemma4 expands the short request):
curl -X POST http://127.0.0.1:8765/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "a red fox",
"assist_prompt": true,
"model": "sdxl-turbo",
"seed": 7
}'Key request fields:
| Field | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | Image description |
negative_prompt |
string | null | What to avoid |
model |
string | sd-turbo |
Model key |
steps |
int | model default | Inference steps |
guidance |
float | model default | CFG scale (turbo models use 0) |
width / height |
int | model default | Must be multiples of 8 |
seed |
int | random | Reproducibility seed |
assist_prompt |
bool | false | Expand via gemma4 first |
use_cache |
bool | true | Return cached result for identical seeded params |
return_mode |
string | "both" |
"base64" / "path" / "both" |
Response shape:
{
"ok": true,
"task": "text2img",
"model": "sd-turbo",
"seed": 42,
"steps": 2,
"guidance": 0.0,
"width": 512,
"height": 512,
"elapsed_s": 2.871,
"prompt": "a red fox sitting in autumn leaves, golden hour, cinematic",
"assisted": false,
"cached": false,
"output_path": "$(pwd)/outputs/20260610_075828_521b48.png",
"image_base64": "<base64 PNG data>",
"mime": "image/png"
}Image-to-image. Alter an existing image guided by a prompt.
Supply the source image as base64 (init_image) or a local path (init_image_path).
# Path-based (simpler for local consumers)
curl -X POST http://127.0.0.1:8765/edit \
-H "Content-Type: application/json" \
-d '{
"prompt": "same fox but in a snowy winter forest",
"init_image_path": "$(pwd)/outputs/selftest.png",
"strength": 0.65,
"seed": 100
}'strength: 0.0 = keep the original, 1.0 = full regeneration. Default 0.6.
Regenerate only the white regions of a mask within the source image.
curl -X POST http://127.0.0.1:8765/inpaint \
-H "Content-Type: application/json" \
-d '{
"prompt": "a bright orange pumpkin lantern",
"init_image_path": "/path/to/source.png",
"mask_image_path": "/path/to/mask.png",
"strength": 0.85
}'Mask convention: white pixels = repaint, black pixels = keep.
Interactive OpenAPI docs (Swagger UI) — open http://127.0.0.1:8765/docs in a browser.
import httpx, base64, pathlib
IMAGEFORGE = "http://127.0.0.1:8765"
async def generate_avatar(prompt: str, seed: int = 42) -> pathlib.Path:
"""Generate an avatar image and return the local path."""
async with httpx.AsyncClient(timeout=120) as client:
r = await client.post(
f"{IMAGEFORGE}/generate",
json={
"prompt": prompt,
"model": "sd-turbo",
"seed": seed,
"width": 512,
"height": 512,
"return_mode": "both",
"assist_prompt": False,
},
)
r.raise_for_status()
data = r.json()
# The output file is already on disk at data["output_path"].
# Optionally decode the inline base64 if you need the bytes in-process:
png_bytes = base64.b64decode(data["image_base64"])
out = pathlib.Path(data["output_path"])
print(f"avatar generated: {out} ({data['elapsed_s']:.1f}s, seed={data['seed']})")
return out
async def edit_avatar(source_path: str, prompt: str) -> pathlib.Path:
"""Alter an existing avatar."""
async with httpx.AsyncClient(timeout=120) as client:
r = await client.post(
f"{IMAGEFORGE}/edit",
json={
"prompt": prompt,
"init_image_path": source_path,
"strength": 0.65,
"return_mode": "both",
},
)
r.raise_for_status()
data = r.json()
return pathlib.Path(data["output_path"])
# Synchronous wrapper for non-async callers:
import asyncio
def generate_avatar_sync(prompt: str, seed: int = 42) -> pathlib.Path:
return asyncio.run(generate_avatar(prompt, seed))The MCP server exposes ImageForge as Model-Context-Protocol tools over stdio. Logs go to stderr; stdout is the protocol channel. It is identical to the HTTP surface in capability — it shares the same engine, cache, and prompt-assist singletons.
Add to your project's .mcp.json (or to ~/.claude/settings.json under "mcpServers"):
{
"mcpServers": {
"imageforge": {
"command": "arch",
"args": [
"-arm64",
"$HOME/.floor-voice-studio/venv-sd/bin/python",
"-m",
"imageforge.mcp_server"
],
"cwd": "~/Developer/imageforge"
}
}
}Alternative (via run_mcp.sh):
{
"mcpServers": {
"imageforge": {
"command": "~/Developer/imageforge/run_mcp.sh",
"args": [],
"cwd": "~/Developer/imageforge"
}
}
}| Tool | Required args | Optional args | Returns |
|---|---|---|---|
generate_image |
prompt |
model, steps, guidance, width, height, seed, negative_prompt, assist |
ImageContent (base64 PNG) + TextContent (JSON metadata) |
edit_image |
prompt, image_path |
strength, model, steps, guidance, seed, negative_prompt, assist |
Edited PNG + JSON metadata |
inpaint_image |
prompt, image_path, mask_path |
strength, model, steps, guidance, seed, negative_prompt, assist |
Result PNG + JSON metadata |
assist_prompt |
prompt |
style |
TextContent — expanded prompt + tags |
list_models |
— | — | TextContent — JSON model registry |
Image paths for edit_image / inpaint_image: pass absolute paths or project-relative paths (e.g. outputs/foo.png). The server resolves relative paths against the project root.
assist=true on generation tools calls gemma4/Ollama to expand a short request into a richer SD prompt. Transparent fallback to original prompt if Ollama is unavailable.
When imageforge is registered, Claude Code can call tools like:
<use_mcp_tool>
<server_name>imageforge</server_name>
<tool_name>generate_image</tool_name>
<arguments>
{
"prompt": "friendly robot avatar, flat design, vibrant colors",
"model": "sd-turbo",
"seed": 99,
"width": 512,
"height": 512,
"assist": true
}
</arguments>
</use_mcp_tool>
The response includes the image inline as ImageContent (rendered in-chat) plus a
TextContent block with the metadata JSON:
{
"task": "text2img",
"model": "sd-turbo",
"seed": 99,
"steps": 2,
"guidance": 0.0,
"width": 512,
"height": 512,
"elapsed_s": 2.91,
"prompt": "friendly robot avatar, ...",
"output_path": "$(pwd)/outputs/20260610_XXXXXX_XXXXXX.png"
}# Handshake + tool discovery + list_models + assist (no GPU):
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python \
-m imageforge.mcp_server --selftest
# Also run a real generation through the MCP path (loads SD, ~15 s):
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python \
-m imageforge.mcp_server --selftest --genExpected output:
[selftest] initialize ok: imageforge v0.1.0
[selftest] list_tools -> ['generate_image', 'edit_image', 'inpaint_image', 'assist_prompt', 'list_models']
[selftest] list_models ok (sd-turbo present)
[selftest] assist_prompt ok: {"prompt": ..., "assisted": false, "source": "passthrough"}
[selftest] generate_image (this loads SD; may take a while)...
[selftest] generate_image -> content kinds: ['image', 'text']
[selftest] generate_image ok: 24871 bytes PNG, mime=image/png
[selftest] ALL PASS
# MCP layer tests:
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -m unittest tests.test_mcp
# Include the real SD generation test (loads diffusers, ~15 s):
IMAGEFORGE_TEST_GEN=1 arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python \
-m unittest tests.test_mcp
# All tests:
arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -m pytest tests/A real smoke test was run on 2026-06-10 (M2 Pro, 16 GB, macOS):
Smoke test output (outputs/20260610_075828_521b48.json):
{
"saved_at_iso": "2026-06-10T07:58:28Z",
"filename": "20260610_075828_521b48.png",
"seed": 42,
"model": "sd-turbo",
"task": "text2img",
"steps": 2,
"guidance": 0.0,
"width": 256,
"height": 256,
"elapsed_s": 15.432,
"prompt": "a red fox in autumn leaves"
}Three generated images, one edited image, and one inpainted image are present in
outputs/ as proof. The MCP self-test (--selftest --gen) exercised the full
MCP handshake, tool discovery, and SD generation through the MCP path.
ImageForge is designed so adding a new consumer is three steps:
- Decide which surface to use: HTTP for any process/language, MCP for AI agents.
- Point at the service: For HTTP, import
httpxand callPOST /generate. For MCP, add the.mcp.jsonregistration block above. - No engine changes needed: The shared Engine, cache, and prompt-assist singletons are already running. Your consumer pays zero additional RAM cost for the model weights — they are already loaded.
Example: a new Python agent using the HTTP surface
# In any Python process on the same machine
import httpx
async def make_image(description: str) -> bytes:
async with httpx.AsyncClient(timeout=120) as client:
r = await client.post(
"http://127.0.0.1:8765/generate",
json={"prompt": description, "return_mode": "base64"},
)
r.raise_for_status()
import base64
return base64.b64decode(r.json()["image_base64"])Example: a new agent using the MCP surface
Add to the agent's MCP config:
"imageforge": {
"command": "arch",
"args": ["-arm64", "$HOME/.floor-voice-studio/venv-sd/bin/python",
"-m", "imageforge.mcp_server"],
"cwd": "~/Developer/imageforge"
}Then call generate_image / edit_image / inpaint_image as MCP tools.
Two standalone measurement scripts live at the project root (not under research/):
| Script | Purpose | Run command |
|---|---|---|
probe_klein_mps.py |
Measure FLUX.2-klein-4B peak RSS + latency on MPS (gates D8 migration decision) | PROBE_DOWNLOAD=1 arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python probe_klein_mps.py |
probe_dit_throughput.py |
Measure from-scratch DiT steps/sec at 256/512 px (gates Phase-3 viability) | arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python probe_dit_throughput.py --resolution 256 --steps 100 |
Both scripts require a CUDA machine (4090) for the DiT throughput probe and a 16 GB Mac for the klein MPS-fit probe. See research/OPERATION.md for full gate criteria.
imageforge/
├── imageforge/
│ ├── settings.py # Global constants (PROJECT_ROOT, OLLAMA_URL, etc.)
│ ├── config.py # Pydantic-settings singleton (get_settings())
│ ├── api/
│ │ ├── app.py # FastAPI application (all routes + lifespan)
│ │ ├── server.py # Compatibility shim (re-exports app)
│ │ ├── prompt_bridge.py # Async bridge to prompt-assist for HTTP handlers
│ │ └── schemas.py # Shared Pydantic schemas
│ ├── engine/
│ │ ├── pipeline.py # Core SD engine (load-once, MPS-locked inference)
│ │ ├── models.py # Model registry (specs + resolve_model)
│ │ └── cache.py # Content-addressed cache + outputs persistence
│ ├── mcp/
│ │ └── server.py # MCP server (re-export of mcp_server.py)
│ ├── mcp_server.py # MCP stdio server (canonical entrypoint)
│ ├── prompt/ # Prompt-assist module namespace
│ └── services/
│ └── prompt_assist.py # Ollama/gemma4 async client (expand + tag)
├── models/ # Downloaded HF model weights (git-ignored)
├── outputs/ # Generated images (git-ignored)
├── cache/ # Content-addressed PNG + JSON cache (git-ignored)
├── tests/ # pytest test suite
├── probe_klein_mps.py # MPS-fit probe for FLUX.2-klein (run at repo root)
├── probe_dit_throughput.py # DiT throughput probe (run at repo root)
├── eval_harness.py # Evaluation harness (run at repo root)
├── bench_flux_lora_latency.py # FLUX LoRA latency benchmark (run at repo root)
├── recaption.py # Image recaptioning utility (run at repo root)
├── run_api.sh # Launch HTTP API
├── run_mcp.sh # Launch MCP server
├── .env.example # Config template
└── pyproject.toml
- Symptom: Python crashes or produces a
RuntimeError: MPS out of memoryduring generation. - Fix:
- Reduce image size:
"width": 256, "height": 256. - Switch to a lighter model:
IMAGEFORGE_MODEL=sd-turbo. - Close memory-intensive apps (browsers, Xcode).
- Do NOT enable
flux-q4on 16 GB — peak memory is 16–18 GB. - Restart the service after an OOM to free GPU memory.
- Reduce image size:
- Symptom:
OSError: ... is not a local folder and is not a valid model identifier. - Fix: Models are downloaded from HuggingFace on first use. Ensure internet access for the first run; subsequent runs are fully offline. Downloaded weights land in
~/.cache/huggingface/. CheckHF_HOMEif disk space is a concern. - Verify the venv has
huggingface_hubinstalled:arch -arm64 $HOME/.floor-voice-studio/venv-sd/bin/python -c "import huggingface_hub; print('ok')"
- Symptom:
assist_promptreturns"assisted": falseor/healthshows"ollama": {"reachable": false}. - This is not fatal. Prompt-assist gracefully degrades — generation continues with the original prompt unchanged.
- Fix: Start Ollama (
ollama serve) and pull the model:ollama pull gemma4:latest
- Verify:
curl http://127.0.0.1:11434/api/tags
- Symptom:
RuntimeError: Model 'flux-q4' is guarded: .... - Fix: Set
ALLOW_FLUX=1in.envor environment. Read the guard reason first — 16 GB machines will swap.
lsof -i :8765
kill <PID>
./run_api.shOr change PORT in .env.
The MCP server writes logs to stderr and the protocol to stdout. If you run it
interactively you will see no stdout output — that is correct. Use --selftest to
verify it works without a real MCP client.