So you want to run your own LLaMA models locally and expose them through a clean, OpenAI-compatible API? That's exactly what LlmVault does — and it stays out of your way while doing it.
It's minimal, self-hosted, and designed to work on Windows, macOS, and Linux — natively, with an optional FRP tunnel if you want to reach it from anywhere.
POST /v1/chat/completions— multi-turn chat with streamingPOST /v1/completions— classic text completion, also streamingPOST /v1/embeddings— vector embeddings for your semantic search needsGET /v1/models— see what models are loadedGET /v1/usage?days=7— daily token usage, broken down by API keyGET /v1/audit?n=50— a full audit log of who asked whatGET /health— a quick pulse check (no auth needed)
Under the hood, API keys are stored in an encrypted database using XChaCha20-Poly1305 with Argon2id key derivation — so your secrets actually stay secret. You also get per-key rate limiting, GPU auto-detection across CUDA / Metal / ROCm / SYCL / Vulkan / CPU, and an FRP tunnel for public access without ever poking a hole in your router.
python start.py --initThis runs an interactive setup that generates your .env with all the secrets it needs, and walks you through downloading a model. You only need to do this once.
You'll need an Ubuntu 24.04 VM somewhere — DigitalOcean, Hetzner, AWS, Azure, GCP, whatever you prefer. Open these ports in your firewall or security group:
| Port | Protocol | What it's for |
|---|---|---|
| 22 | TCP | SSH access |
| 80 | TCP | Let's Encrypt ACME challenge |
| 443 | Any | HTTPS |
| 7000 | TCP | FRP tunnel |
| 8000 | TCP | API (TCP proxy) |
Then deploy everything with a single script:
Linux / macOS:
chmod 400 linux_key.pem
./scripts/deploy_vps.sh linux_key.pem <vm-public-ip>Windows (PowerShell):
.\scripts\deploy_vps.ps1 -Pem linux_key.pem -IP <vm-public-ip>It'll ask you two questions — your domain (the default <ip>.nip.io works great and needs zero registration) and an email for Let's Encrypt. Everything else happens automatically: Docker gets installed on the VM, frps and caddy start up, and your .env gets updated with the server address.
sudo docker compose up -dGive it a second, then check the logs to confirm the tunnel connected:
sudo docker compose logs frpcYou're looking for something like:
[I] [proxy_manager.go] proxy added: [llamagate]
[I] [control.go] [llamagate] login to server success
If you see token mismatch or i/o timeout instead, just re-run the deploy script — it'll re-sync your .env with the VPS.
python start.pypython3 getkey.pyThis reads your encrypted database and prints your key:
API key: <your-64-char-hex-key>
Keep this somewhere safe. You'll need it for every /v1/* request.
API_KEY="your-api-key"
curl https://<domain>/health
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"Or run the full test suite if you want to be thorough:
LLAMAGATE_URL="https://<domain>" \
LLAMAGATE_API_KEY="your-api-key" \
python3 test_openai.pycurl → HTTPS:443 → Caddy (TLS) → frps:8000 → TLS tunnel → frpc → server:8000 → GPU
✅ encrypted ✅ encrypted │
▼
snkv database
✅ XChaCha20-Poly1305 encrypted
(keys, audit, usage, rate limits)
- Caddy handles HTTPS and auto-renews your Let's Encrypt cert so you never have to think about it
- frps lives on your VM and receives the tunnel connection from your local machine
- frpc runs locally and forwards traffic from the VM to your local server
- The FRP tunnel itself is TLS-encrypted end-to-end
- snkv is an embedded encrypted database — every value at rest (API keys, audit log, usage stats, rate limit counters) is encrypted with XChaCha20-Poly1305, with the password stretched via Argon2id
This is where you tell LlmVault which models to use:
preload: false
models:
- id: llama3.2-3b
path: /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
n_ctx: 8192
# n_gpu_layers: -1 # all layers on GPU (default if GPU is detected)
# n_gpu_layers: 0 # force CPU-onlyNeed embeddings? Use --add-model to download any additional model without touching your existing config:
python3 start.py --add-modelPick from the menu — option 6 is Nomic Embed, which comes pre-configured as an embedding model and is tiny (~90 MB):
1. Llama 3.2 3B ...
2. Llama 3.1 8B ...
...
6. Nomic Embed Text v1.5 (Q4_K_M, ~90 MB) — embedding model
7. Custom (enter HuggingFace repo + filename)
Then restart the server. Or just add it manually to models.yaml:
preload: false
models:
- id: llama3.2-3b
path: /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
n_ctx: 8192
- id: nomic-embed
path: /models/nomic-embed-text-v1.5.Q4_K_M.gguf
n_ctx: 2048
embedding_only: trueThen call it like any embeddings endpoint:
curl https://<domain>/v1/embeddings \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"nomic-embed","input":["Hello world","LlamaGate"]}'Every /v1/* endpoint requires Authorization: Bearer <key>.
# ── Health check (no auth needed) ─────────────────────────────────────────────
curl https://<domain>/health
# {"status":"ok","gpu":"cuda","loaded":["llama3.2-3b","nomic-embed"]}
# ── Invalid key returns 401, as expected ──────────────────────────────────────
curl https://<domain>/v1/models -H "Authorization: Bearer badkey"
# {"detail":"Invalid or missing API key"}
# ── List models ───────────────────────────────────────────────────────────────
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"
# ── Chat completion ───────────────────────────────────────────────────────────
curl https://<domain>/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2-3b","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'
# ── Chat completion (streaming) ───────────────────────────────────────────────
curl https://<domain>/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2-3b","messages":[{"role":"user","content":"Count to 5"}],"max_tokens":50,"stream":true}'
# ── Text completion ───────────────────────────────────────────────────────────
curl https://<domain>/v1/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2-3b","prompt":"The capital of France is","max_tokens":10}'
# ── Text completion (streaming) ───────────────────────────────────────────────
curl https://<domain>/v1/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2-3b","prompt":"1 + 1 =","max_tokens":5,"stream":true}'
# ── Embeddings ────────────────────────────────────────────────────────────────
curl https://<domain>/v1/embeddings \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"nomic-embed","input":["Hello world","LlamaGate"]}'
# ── Usage — daily token counts for your key ───────────────────────────────────
curl "https://<domain>/v1/usage?days=7" -H "Authorization: Bearer $API_KEY"
# ── Audit log — last N requests across all keys ───────────────────────────────
curl "https://<domain>/v1/audit?n=10" -H "Authorization: Bearer $API_KEY"
# ── Rate limit in action — 429 when you exceed the limit ─────────────────────
# (only applies if you set RATE_LIMIT in .env, e.g. RATE_LIMIT=5 RATE_WINDOW=60)
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"
# {"detail":"Rate limit exceeded"}Since LlmVault speaks the OpenAI API dialect, you can point the official SDK straight at it:
from openai import OpenAI
client = OpenAI(
base_url="https://<domain>/v1",
api_key="your-api-key",
)
response = client.chat.completions.create(
model="llama3.2-3b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)That's it. No other changes needed.
| Feature | Supported |
|---|---|
POST /v1/chat/completions |
✅ |
POST /v1/completions |
✅ |
POST /v1/embeddings |
✅ |
GET /v1/models |
✅ |
stream=True (all endpoints) |
✅ |
temperature, top_p, stop, max_tokens |
✅ |
| Multi-turn conversations | ✅ |
| System / user / assistant roles | ✅ |
POST /v1/images/generations |
❌ local models only |
POST /v1/audio/transcriptions |
❌ local models only |
POST /v1/audio/speech |
❌ local models only |
POST /v1/fine_tuning/jobs |
❌ cloud-only |
POST /v1/assistants |
❌ cloud-only |
POST /v1/files |
❌ cloud-only |
POST /v1/moderations |
❌ cloud-only |
logprobs, n, presence_penalty, frequency_penalty |
❌ not yet implemented |
| Variable | Default | What it does |
|---|---|---|
SNKV_PASSWORD |
— | Encrypts the database. Auto-generated by --init — don't lose it. |
SNKV_DB |
/data/llamagate.db or ~/.llamagate/llamagate.db if /data not present |
Where the database lives |
RATE_LIMIT |
0 |
Max requests per window per key. 0 means disabled. |
RATE_WINDOW |
60 |
Rate limit window in seconds |
FRP_SERVER_ADDR |
— | Your VM's public IP (set automatically by the deploy script) |
FRP_TOKEN |
— | Tunnel auth secret. Auto-generated by --init. |
DOMAIN |
— | Your public domain for HTTPS |
ACME_EMAIL |
— | Email address for Let's Encrypt notifications |
Things go wrong sometimes. Here's what to do:
| What you're seeing | What to do |
|---|---|
docker: permission denied |
sudo usermod -aG docker $USER && newgrp docker |
token mismatch in frpc/frps logs |
Re-run ./scripts/deploy_vps.sh — your .env got out of sync |
i/o timeout on port 7000 |
Open port 7000 in your cloud firewall / security group |
connection refused on port 8000 |
Open port 8000 in your cloud firewall / security group |
| Caddy cert times out | Make sure ports 80 and 443 are open |
| VM shuts down out of nowhere | Check your cloud provider's auto-shutdown / scheduled action settings |
proxy already exists in frpc logs |
sudo docker compose down && sudo docker compose up -d |
connection refused on 127.0.0.1:8000 |
python start.py isn't running — start the server first |
llamagate/
├── server/
│ ├── main.py # FastAPI app — routes, auth, streaming
│ ├── store.py # snkv: keys, rate limits, audit, usage
│ ├── config.py # Settings + snkv init
│ ├── requirements.txt
│ └── Dockerfile
├── frp/
│ ├── frpc.toml # FRP client config (runs on your local machine)
│ └── frps.toml # FRP server config (runs on the VM)
├── scripts/
│ ├── deploy_vps.sh # One-command VPS deploy for Linux / macOS
│ └── deploy_vps.ps1 # One-command VPS deploy for Windows PowerShell
├── start.py # Native launcher + model downloader
├── models.yaml # Which models to load
├── Caddyfile.vps # Caddy config for the VM
├── docker-compose.yml # Local setup (frpc only)
├── docker-compose.vps.yml # VM setup (frps + caddy)
└── .env # Your secrets (auto-generated by --init)
LlmVault is not affiliated with OpenAI. It provides an OpenAI-compatible interface for local models, but uses none of OpenAI's infrastructure or models.
You're in charge of how you expose it. If you're making this public, please use authentication and rate limiting. Don't leave an open API server on the internet.
Model quality varies. Everything runs locally using third-party GGUF models. Outputs can be inaccurate or inconsistent depending on the model. Check the license of any model you download.
Security is taken seriously here, but no system is bulletproof. Use good practices: proper firewall rules, strong secrets, and don't commit your .env file anywhere.
This software comes as-is. No warranties, no guarantees. Use it, break it, improve it.
Apache License 2.0 © 2025 Hash Anu