Skip to content
mtecnic edited this page May 28, 2026 · 2 revisions

Model Chat CLI — Wiki

╔════════════════════════════════════════════════════════════════════════╗
║   A terminal command center for local AI servers                       ║
║   Discover · Chat · Benchmark · Battle · Probe agentic capability      ║
╚════════════════════════════════════════════════════════════════════════╝

Welcome. This wiki is the deep-dive companion to the README. The README answers what the tool does — the wiki answers how it works, why it was built that way, and how to bend it to your setup.


Start here

If you want to… Go to
Install and run for the first time Installation · Quick-Start
Understand the network scan Discovery
Read the streaming / TPS / TTFT internals Chat
Compare models head-to-head Arena
Find the best system prompt for your task Prompt-Arena
Hammer a server with load Stress-Testing
Score tool-calling capability Tool-Calling-Benchmark
See how the pieces fit together Architecture
Look up a function or class API-Reference
Fix something that broke Troubleshooting
Add a feature or a new server backend Contributing

Tier-zero facts

  • No telemetry. Nothing leaves your machine except the requests you make to your own servers.
  • No cloud calls. Discovery, chat, arena, stress tests, and the tool bench all run against models you host.
  • OpenAI + Ollama protocols. Anything that speaks /v1/chat/completions or /api/chat works.
  • One config file. ~/.model_chat_cache.json — that's it.

Five things people miss on first read

  1. The chat TPS number is decode-only. The timer starts on the first token, so the reported t/s reflects actual generation speed, not wall-clock with TTFT mixed in. See Chat#performance-metrics.
  2. The tool bench has six tiers, not one. Quick (7) is a smoke test. EXTREME (8) includes prompt-injection resistance and conflicting tool sources. See Tool-Calling-Benchmark#difficulty-tiers.
  3. Realistic-user mode uses Poisson arrivals, not a fixed RPS. It models bursty real traffic with multi-turn growing context. See Stress-Testing#realistic-user.
  4. Arena tournament judging is suite-specific. Coding prompts are scored on correctness, not "creativity" or "tone". See Arena#tournament-judging.
  5. Discovery caches healthy servers. First scan is slow, every subsequent launch is instant — re-validation runs in parallel against the cache. See Discovery#caching.

License & origin

MIT. Built by @mtecnic for people who actually run local models and want to know what they're doing.

Clone this wiki locally