Skip to content

hash-anu/llmvault

Repository files navigation

LlmVault

So you want to run your own LLaMA models locally and expose them through a clean, OpenAI-compatible API? That's exactly what LlmVault does — and it stays out of your way while doing it.

It's minimal, self-hosted, and designed to work on Windows, macOS, and Linux — natively, with an optional FRP tunnel if you want to reach it from anywhere.


What it does

  • POST /v1/chat/completions — multi-turn chat with streaming
  • POST /v1/completions — classic text completion, also streaming
  • POST /v1/embeddings — vector embeddings for your semantic search needs
  • GET /v1/models — see what models are loaded
  • GET /v1/usage?days=7 — daily token usage, broken down by API key
  • GET /v1/audit?n=50 — a full audit log of who asked what
  • GET /health — a quick pulse check (no auth needed)

Under the hood, API keys are stored in an encrypted database using XChaCha20-Poly1305 with Argon2id key derivation — so your secrets actually stay secret. You also get per-key rate limiting, GPU auto-detection across CUDA / Metal / ROCm / SYCL / Vulkan / CPU, and an FRP tunnel for public access without ever poking a hole in your router.


Getting started

Step 1 — One-time setup (your local machine)

python start.py --init

This runs an interactive setup that generates your .env with all the secrets it needs, and walks you through downloading a model. You only need to do this once.


Step 2 — Spin up a VPS (any public VM)

You'll need an Ubuntu 24.04 VM somewhere — DigitalOcean, Hetzner, AWS, Azure, GCP, whatever you prefer. Open these ports in your firewall or security group:

Port Protocol What it's for
22 TCP SSH access
80 TCP Let's Encrypt ACME challenge
443 Any HTTPS
7000 TCP FRP tunnel
8000 TCP API (TCP proxy)

Then deploy everything with a single script:

Linux / macOS:

chmod 400 linux_key.pem
./scripts/deploy_vps.sh linux_key.pem <vm-public-ip>

Windows (PowerShell):

.\scripts\deploy_vps.ps1 -Pem linux_key.pem -IP <vm-public-ip>

It'll ask you two questions — your domain (the default <ip>.nip.io works great and needs zero registration) and an email for Let's Encrypt. Everything else happens automatically: Docker gets installed on the VM, frps and caddy start up, and your .env gets updated with the server address.


Step 3 — Start the tunnel (local machine)

sudo docker compose up -d

Give it a second, then check the logs to confirm the tunnel connected:

sudo docker compose logs frpc

You're looking for something like:

[I] [proxy_manager.go] proxy added: [llamagate]
[I] [control.go] [llamagate] login to server success

If you see token mismatch or i/o timeout instead, just re-run the deploy script — it'll re-sync your .env with the VPS.


Step 4 — Start the server (local machine)

python start.py

Step 5 — Get your API key

python3 getkey.py

This reads your encrypted database and prints your key:

API key: <your-64-char-hex-key>

Keep this somewhere safe. You'll need it for every /v1/* request.


Step 6 — Make sure it's working

API_KEY="your-api-key"

curl https://<domain>/health
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"

Or run the full test suite if you want to be thorough:

LLAMAGATE_URL="https://<domain>" \
LLAMAGATE_API_KEY="your-api-key" \
python3 test_openai.py

How the pieces fit together

curl → HTTPS:443 → Caddy (TLS) → frps:8000 → TLS tunnel → frpc → server:8000 → GPU
        ✅ encrypted                            ✅ encrypted              │
                                                                          ▼
                                                                   snkv database
                                                                   ✅ XChaCha20-Poly1305 encrypted
                                                              (keys, audit, usage, rate limits)
  • Caddy handles HTTPS and auto-renews your Let's Encrypt cert so you never have to think about it
  • frps lives on your VM and receives the tunnel connection from your local machine
  • frpc runs locally and forwards traffic from the VM to your local server
  • The FRP tunnel itself is TLS-encrypted end-to-end
  • snkv is an embedded encrypted database — every value at rest (API keys, audit log, usage stats, rate limit counters) is encrypted with XChaCha20-Poly1305, with the password stretched via Argon2id

models.yaml

This is where you tell LlmVault which models to use:

preload: false

models:
  - id: llama3.2-3b
    path: /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
    n_ctx: 8192
    # n_gpu_layers: -1   # all layers on GPU (default if GPU is detected)
    # n_gpu_layers: 0    # force CPU-only

Adding an embedding model

Need embeddings? Use --add-model to download any additional model without touching your existing config:

python3 start.py --add-model

Pick from the menu — option 6 is Nomic Embed, which comes pre-configured as an embedding model and is tiny (~90 MB):

1. Llama 3.2 3B  ...
2. Llama 3.1 8B  ...
...
6. Nomic Embed Text v1.5  (Q4_K_M, ~90 MB)  — embedding model
7. Custom (enter HuggingFace repo + filename)

Then restart the server. Or just add it manually to models.yaml:

preload: false

models:
  - id: llama3.2-3b
    path: /models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
    n_ctx: 8192

  - id: nomic-embed
    path: /models/nomic-embed-text-v1.5.Q4_K_M.gguf
    n_ctx: 2048
    embedding_only: true

Then call it like any embeddings endpoint:

curl https://<domain>/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed","input":["Hello world","LlamaGate"]}'

API reference

Every /v1/* endpoint requires Authorization: Bearer <key>.

# ── Health check (no auth needed) ─────────────────────────────────────────────
curl https://<domain>/health
# {"status":"ok","gpu":"cuda","loaded":["llama3.2-3b","nomic-embed"]}

# ── Invalid key returns 401, as expected ──────────────────────────────────────
curl https://<domain>/v1/models -H "Authorization: Bearer badkey"
# {"detail":"Invalid or missing API key"}

# ── List models ───────────────────────────────────────────────────────────────
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"

# ── Chat completion ───────────────────────────────────────────────────────────
curl https://<domain>/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'

# ── Chat completion (streaming) ───────────────────────────────────────────────
curl https://<domain>/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","messages":[{"role":"user","content":"Count to 5"}],"max_tokens":50,"stream":true}'

# ── Text completion ───────────────────────────────────────────────────────────
curl https://<domain>/v1/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","prompt":"The capital of France is","max_tokens":10}'

# ── Text completion (streaming) ───────────────────────────────────────────────
curl https://<domain>/v1/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b","prompt":"1 + 1 =","max_tokens":5,"stream":true}'

# ── Embeddings ────────────────────────────────────────────────────────────────
curl https://<domain>/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed","input":["Hello world","LlamaGate"]}'

# ── Usage — daily token counts for your key ───────────────────────────────────
curl "https://<domain>/v1/usage?days=7" -H "Authorization: Bearer $API_KEY"

# ── Audit log — last N requests across all keys ───────────────────────────────
curl "https://<domain>/v1/audit?n=10" -H "Authorization: Bearer $API_KEY"

# ── Rate limit in action — 429 when you exceed the limit ─────────────────────
# (only applies if you set RATE_LIMIT in .env, e.g. RATE_LIMIT=5 RATE_WINDOW=60)
curl https://<domain>/v1/models -H "Authorization: Bearer $API_KEY"
# {"detail":"Rate limit exceeded"}

Drop-in OpenAI replacement

Since LlmVault speaks the OpenAI API dialect, you can point the official SDK straight at it:

from openai import OpenAI

client = OpenAI(
    base_url="https://<domain>/v1",
    api_key="your-api-key",
)

response = client.chat.completions.create(
    model="llama3.2-3b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

That's it. No other changes needed.

What's supported

Feature Supported
POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
GET /v1/models
stream=True (all endpoints)
temperature, top_p, stop, max_tokens
Multi-turn conversations
System / user / assistant roles
POST /v1/images/generations ❌ local models only
POST /v1/audio/transcriptions ❌ local models only
POST /v1/audio/speech ❌ local models only
POST /v1/fine_tuning/jobs ❌ cloud-only
POST /v1/assistants ❌ cloud-only
POST /v1/files ❌ cloud-only
POST /v1/moderations ❌ cloud-only
logprobs, n, presence_penalty, frequency_penalty ❌ not yet implemented

Environment variables

Variable Default What it does
SNKV_PASSWORD Encrypts the database. Auto-generated by --init — don't lose it.
SNKV_DB /data/llamagate.db or ~/.llamagate/llamagate.db if /data not present Where the database lives
RATE_LIMIT 0 Max requests per window per key. 0 means disabled.
RATE_WINDOW 60 Rate limit window in seconds
FRP_SERVER_ADDR Your VM's public IP (set automatically by the deploy script)
FRP_TOKEN Tunnel auth secret. Auto-generated by --init.
DOMAIN Your public domain for HTTPS
ACME_EMAIL Email address for Let's Encrypt notifications

Troubleshooting

Things go wrong sometimes. Here's what to do:

What you're seeing What to do
docker: permission denied sudo usermod -aG docker $USER && newgrp docker
token mismatch in frpc/frps logs Re-run ./scripts/deploy_vps.sh — your .env got out of sync
i/o timeout on port 7000 Open port 7000 in your cloud firewall / security group
connection refused on port 8000 Open port 8000 in your cloud firewall / security group
Caddy cert times out Make sure ports 80 and 443 are open
VM shuts down out of nowhere Check your cloud provider's auto-shutdown / scheduled action settings
proxy already exists in frpc logs sudo docker compose down && sudo docker compose up -d
connection refused on 127.0.0.1:8000 python start.py isn't running — start the server first

Project layout

llamagate/
├── server/
│   ├── main.py               # FastAPI app — routes, auth, streaming
│   ├── store.py              # snkv: keys, rate limits, audit, usage
│   ├── config.py             # Settings + snkv init
│   ├── requirements.txt
│   └── Dockerfile
├── frp/
│   ├── frpc.toml             # FRP client config (runs on your local machine)
│   └── frps.toml             # FRP server config (runs on the VM)
├── scripts/
│   ├── deploy_vps.sh         # One-command VPS deploy for Linux / macOS
│   └── deploy_vps.ps1        # One-command VPS deploy for Windows PowerShell
├── start.py                  # Native launcher + model downloader
├── models.yaml               # Which models to load
├── Caddyfile.vps             # Caddy config for the VM
├── docker-compose.yml        # Local setup (frpc only)
├── docker-compose.vps.yml    # VM setup (frps + caddy)
└── .env                      # Your secrets (auto-generated by --init)

A few things worth knowing

LlmVault is not affiliated with OpenAI. It provides an OpenAI-compatible interface for local models, but uses none of OpenAI's infrastructure or models.

You're in charge of how you expose it. If you're making this public, please use authentication and rate limiting. Don't leave an open API server on the internet.

Model quality varies. Everything runs locally using third-party GGUF models. Outputs can be inaccurate or inconsistent depending on the model. Check the license of any model you download.

Security is taken seriously here, but no system is bulletproof. Use good practices: proper firewall rules, strong secrets, and don't commit your .env file anywhere.

This software comes as-is. No warranties, no guarantees. Use it, break it, improve it.


License

Apache License 2.0 © 2025 Hash Anu

About

Self-host any GGUF model on your local machine and expose it securely as an OpenAI-compatible API over a public HTTPS URL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors