LlmVault

So you want to run your own LLaMA models locally and expose them through a clean, OpenAI-compatible API? That's exactly what LlmVault does — and it stays out of your way while doing it.

It's minimal, self-hosted, and designed to work on Windows, macOS, and Linux — natively, with an optional FRP tunnel if you want to reach it from anywhere.

What it does

POST /v1/chat/completions — multi-turn chat with streaming
POST /v1/completions — classic text completion, also streaming
POST /v1/embeddings — vector embeddings for your semantic search needs
GET /v1/models — see what models are loaded
GET /v1/usage?days=7 — daily token usage, broken down by API key
GET /v1/audit?n=50 — a full audit log of who asked what
GET /health — a quick pulse check (no auth needed)

Under the hood, API keys are stored in an encrypted database using XChaCha20-Poly1305 with Argon2id key derivation — so your secrets actually stay secret. You also get per-key rate limiting, GPU auto-detection across CUDA / Metal / ROCm / SYCL / Vulkan / CPU, and an FRP tunnel for public access without ever poking a hole in your router.

Getting started

Step 1 — One-time setup (your local machine)

python start.py --init

This runs an interactive setup that generates your .env with all the secrets it needs, and walks you through downloading a model. You only need to do this once.

Step 2 — Spin up a VPS (any public VM)

You'll need an Ubuntu 24.04 VM somewhere — DigitalOcean, Hetzner, AWS, Azure, GCP, whatever you prefer. Open these ports in your firewall or security group:

Port	Protocol	What it's for
22	TCP	SSH access
80	TCP	Let's Encrypt ACME challenge
443	Any	HTTPS
7000	TCP	FRP tunnel
8000	TCP	API (TCP proxy)

Then deploy everything with a single script:

Linux / macOS:

chmod 400 linux_key.pem
./scripts/deploy_vps.sh linux_key.pem <vm-public-ip>

Windows (PowerShell):

.\scripts\deploy_vps.ps1 -Pem linux_key.pem -IP <vm-public-ip>

It'll ask you two questions — your domain (the default <ip>.nip.io works great and needs zero registration) and an email for Let's Encrypt. Everything else happens automatically: Docker gets installed on the VM, frps and caddy start up, and your .env gets updated with the server address.

Step 3 — Start the tunnel (local machine)

sudo docker compose up -d

Give it a second, then check the logs to confirm the tunnel connected:

sudo docker compose logs frpc

You're looking for something like:

[I] [proxy_manager.go] proxy added: [llamagate]
[I] [control.go] [llamagate] login to server success

If you see token mismatch or i/o timeout instead, just re-run the deploy script — it'll re-sync your .env with the VPS.

Step 4 — Start the server (local machine)

python start.py

Step 5 — Get your API key

python3 getkey.py

This reads your encrypted database and prints your key:

API key: <your-64-char-hex-key>

Keep this somewhere safe. You'll need it for every /v1/* request.

Step 6 — Make sure it's working

API_KEY="your-api-key"

curl https://<domain>/health
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"

Or run the full test suite if you want to be thorough:

LLAMAGATE_URL="https://<domain>" \
LLAMAGATE_API_KEY="your-api-key" \
python3 test_openai.py

How the pieces fit together

curl → HTTPS:443 → Caddy (TLS) → frps:8000 → TLS tunnel → frpc → server:8000 → GPU
        ✅ encrypted                            ✅ encrypted              │
                                                                          ▼
                                                                   snkv database
                                                                   ✅ XChaCha20-Poly1305 encrypted
                                                              (keys, audit, usage, rate limits)

Caddy handles HTTPS and auto-renews your Let's Encrypt cert so you never have to think about it
frps lives on your VM and receives the tunnel connection from your local machine
frpc runs locally and forwards traffic from the VM to your local server
The FRP tunnel itself is TLS-encrypted end-to-end
snkv is an embedded encrypted database — every value at rest (API keys, audit log, usage stats, rate limit counters) is encrypted with XChaCha20-Poly1305, with the password stretched via Argon2id

models.yaml

This is where you tell LlmVault which models to use:

preload: false

models:
  - id: llama3.2-3b
    path: /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
    n_ctx: 8192
    # n_gpu_layers: -1   # all layers on GPU (default if GPU is detected)
    # n_gpu_layers: 0    # force CPU-only

Adding an embedding model

Need embeddings? Use --add-model to download any additional model without touching your existing config:

python3 start.py --add-model

Pick from the menu — option 6 is Nomic Embed, which comes pre-configured as an embedding model and is tiny (~90 MB):

1. Llama 3.2 3B  ...
2. Llama 3.1 8B  ...
...
6. Nomic Embed Text v1.5  (Q4_K_M, ~90 MB)  — embedding model
7. Custom (enter HuggingFace repo + filename)

Then restart the server. Or just add it manually to models.yaml:

preload: false

models:
  - id: llama3.2-3b
    path: /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
    n_ctx: 8192

  - id: nomic-embed
    path: /models/nomic-embed-text-v1.5.Q4_K_M.gguf
    n_ctx: 2048
    embedding_only: true

Then call it like any embeddings endpoint:

curl https://<domain>/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed","input":["Hello world","LlamaGate"]}'

API reference

Every /v1/* endpoint requires Authorization: Bearer <key>.

# ── Health check (no auth needed) ─────────────────────────────────────────────
curl https://<domain>/health
# {"status":"ok","gpu":"cuda","loaded":["llama3.2-3b","nomic-embed"]}

# ── Invalid key returns 401, as expected ──────────────────────────────────────
curl https://<domain>/v1/models -H "Authorization: Bearer badkey"
# {"detail":"Invalid or missing API key"}

# ── List models ───────────────────────────────────────────────────────────────
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"

# ── Chat completion ───────────────────────────────────────────────────────────
curl https://<domain>/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'

# ── Chat completion (streaming) ───────────────────────────────────────────────
curl https://<domain>/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","messages":[{"role":"user","content":"Count to 5"}],"max_tokens":50,"stream":true}'

# ── Text completion ───────────────────────────────────────────────────────────
curl https://<domain>/v1/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","prompt":"The capital of France is","max_tokens":10}'

# ── Text completion (streaming) ───────────────────────────────────────────────
curl https://<domain>/v1/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","prompt":"1 + 1 =","max_tokens":5,"stream":true}'

# ── Embeddings ────────────────────────────────────────────────────────────────
curl https://<domain>/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed","input":["Hello world","LlamaGate"]}'

# ── Usage — daily token counts for your key ───────────────────────────────────
curl "https://<domain>/v1/usage?days=7" -H "Authorization: Bearer $API_KEY"

# ── Audit log — last N requests across all keys ───────────────────────────────
curl "https://<domain>/v1/audit?n=10" -H "Authorization: Bearer $API_KEY"

# ── Rate limit in action — 429 when you exceed the limit ─────────────────────
# (only applies if you set RATE_LIMIT in .env, e.g. RATE_LIMIT=5 RATE_WINDOW=60)
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"
# {"detail":"Rate limit exceeded"}

Drop-in OpenAI replacement

Since LlmVault speaks the OpenAI API dialect, you can point the official SDK straight at it:

from openai import OpenAI

client = OpenAI(
    base_url="https://<domain>/v1",
    api_key="your-api-key",
)

response = client.chat.completions.create(
    model="llama3.2-3b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

That's it. No other changes needed.

What's supported

Feature	Supported
`POST /v1/chat/completions`	✅
`POST /v1/completions`	✅
`POST /v1/embeddings`	✅
`GET /v1/models`	✅
`stream=True` (all endpoints)	✅
`temperature`, `top_p`, `stop`, `max_tokens`	✅
Multi-turn conversations	✅
System / user / assistant roles	✅
`POST /v1/images/generations`	❌ local models only
`POST /v1/audio/transcriptions`	❌ local models only
`POST /v1/audio/speech`	❌ local models only
`POST /v1/fine_tuning/jobs`	❌ cloud-only
`POST /v1/assistants`	❌ cloud-only
`POST /v1/files`	❌ cloud-only
`POST /v1/moderations`	❌ cloud-only
`logprobs`, `n`, `presence_penalty`, `frequency_penalty`	❌ not yet implemented

Environment variables

Variable	Default	What it does
`SNKV_PASSWORD`	—	Encrypts the database. Auto-generated by `--init` — don't lose it.
`SNKV_DB`	`/data/llamagate.db or ~/.llamagate/llamagate.db if /data not present`	Where the database lives
`RATE_LIMIT`	`0`	Max requests per window per key. `0` means disabled.
`RATE_WINDOW`	`60`	Rate limit window in seconds
`FRP_SERVER_ADDR`	—	Your VM's public IP (set automatically by the deploy script)
`FRP_TOKEN`	—	Tunnel auth secret. Auto-generated by `--init`.
`DOMAIN`	—	Your public domain for HTTPS
`ACME_EMAIL`	—	Email address for Let's Encrypt notifications

Troubleshooting

Things go wrong sometimes. Here's what to do:

What you're seeing	What to do
`docker: permission denied`	`sudo usermod -aG docker $USER && newgrp docker`
`token mismatch` in frpc/frps logs	Re-run `./scripts/deploy_vps.sh` — your `.env` got out of sync
`i/o timeout` on port 7000	Open port 7000 in your cloud firewall / security group
`connection refused` on port 8000	Open port 8000 in your cloud firewall / security group
Caddy cert times out	Make sure ports 80 and 443 are open
VM shuts down out of nowhere	Check your cloud provider's auto-shutdown / scheduled action settings
`proxy already exists` in frpc logs	`sudo docker compose down && sudo docker compose up -d`
`connection refused` on 127.0.0.1:8000	`python start.py` isn't running — start the server first

Project layout

llamagate/
├── server/
│   ├── main.py               # FastAPI app — routes, auth, streaming
│   ├── store.py              # snkv: keys, rate limits, audit, usage
│   ├── config.py             # Settings + snkv init
│   ├── requirements.txt
│   └── Dockerfile
├── frp/
│   ├── frpc.toml             # FRP client config (runs on your local machine)
│   └── frps.toml             # FRP server config (runs on the VM)
├── scripts/
│   ├── deploy_vps.sh         # One-command VPS deploy for Linux / macOS
│   └── deploy_vps.ps1        # One-command VPS deploy for Windows PowerShell
├── start.py                  # Native launcher + model downloader
├── models.yaml               # Which models to load
├── Caddyfile.vps             # Caddy config for the VM
├── docker-compose.yml        # Local setup (frpc only)
├── docker-compose.vps.yml    # VM setup (frps + caddy)
└── .env                      # Your secrets (auto-generated by --init)

A few things worth knowing

LlmVault is not affiliated with OpenAI. It provides an OpenAI-compatible interface for local models, but uses none of OpenAI's infrastructure or models.

You're in charge of how you expose it. If you're making this public, please use authentication and rate limiting. Don't leave an open API server on the internet.

Model quality varies. Everything runs locally using third-party GGUF models. Outputs can be inaccurate or inconsistent depending on the model. Check the license of any model you download.

Security is taken seriously here, but no system is bulletproof. Use good practices: proper firewall rules, strong secrets, and don't commit your .env file anywhere.

This software comes as-is. No warranties, no guarantees. Use it, break it, improve it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlmVault

What it does

Getting started

Step 1 — One-time setup (your local machine)

Step 2 — Spin up a VPS (any public VM)

Step 3 — Start the tunnel (local machine)

Step 4 — Start the server (local machine)

Step 5 — Get your API key

Step 6 — Make sure it's working

How the pieces fit together

models.yaml

Adding an embedding model

API reference

Drop-in OpenAI replacement

What's supported

Environment variables

Troubleshooting

Project layout

A few things worth knowing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
frp		frp
scripts		scripts
server		server
.gitignore		.gitignore
Caddyfile.vps		Caddyfile.vps
LICENSE		LICENSE
README.md		README.md
docker-compose.vps.yml		docker-compose.vps.yml
docker-compose.yml		docker-compose.yml
getkey.py		getkey.py
models.yaml.example		models.yaml.example
start.py		start.py

Folders and files

Latest commit

History

Repository files navigation

LlmVault

What it does

Getting started

Step 1 — One-time setup (your local machine)

Step 2 — Spin up a VPS (any public VM)

Step 3 — Start the tunnel (local machine)

Step 4 — Start the server (local machine)

Step 5 — Get your API key

Step 6 — Make sure it's working

How the pieces fit together

models.yaml

Adding an embedding model

API reference

Drop-in OpenAI replacement

What's supported

Environment variables

Troubleshooting

Project layout

A few things worth knowing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages