docs_rag_app is a small companion project for the Practical Tech Guide series
book Building RAG Applications.
The goal is to provide a narrow, readable starter project that can grow with the book:
- load local Markdown and text documents
- split them into chunks
- retrieve relevant chunks for a question
- assemble a grounded answer with citations
The current scaffold starts with lexical retrieval, includes a small embedding-backed vector path, and adds a lightweight reranking step so the project can compare multiple retrieval stages without introducing heavy dependencies too early.
The bundled sample corpus is a small fictional support-team handbook, not book content. It is original example content written for this project, and it gives the app a realistic mix of Markdown and plain-text documents to search over during the examples below.
The starter project is split into a few small modules:
loaders.pyload and normalize local documentschunking.pysplit documents into small retrievable chunksretrieval.pyrank chunks for a questiongeneration.pyturn retrieved chunks into a grounded answerservice.pycache the ingested corpus and refresh it when files changeindex_store.pypersist the local index so later queries can reuse itpipeline.pyrun a simple retrieve-and-answer flowcli.pyprovide a small local command-line interfaceweb.pyprovide a small local API and browser query UI
Clone the repo and set up the local environment:
git clone https://github.com/nextframedev/docs_rag_app.git
cd docs_rag_app
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
pytest
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--question "Who should publish external updates during a sev1 incident?"Try the starter vector path:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--retrieval-mode vector \
--question "Who should publish external updates during a sev1 incident?"Try the lightweight reranker:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--retrieval-mode lexical \
--reranking keyword \
--question "Who should publish external updates during a sev1 incident?" \
--jsonTry the grounded extractive answer mode explicitly:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--answer-mode extractive \
--question "What must be removed before sharing screenshots outside the company?" \
--jsonTry an OpenAI-compatible answer step:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--answer-mode openai-compatible \
--llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
--llm-model llama3.2 \
--question "How should support phrase a reply when timing is uncertain?" \
--jsonCompare extractive and OpenAI-compatible answer evaluation side by side:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--answer-eval-set examples/answer_eval_set.json \
--compare-answer-modes \
--answer-mode openai-compatible \
--llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
--llm-model llama3.2 \
--jsonBuild or refresh the persisted local index:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--build-indexRun the local API and browser UI:
python -m docs_rag_app.web --corpus examples/sample_corpusThen open http://127.0.0.1:8000.
Keep the default loopback host unless you have a specific reason to expose the local web app on your network. The browser UI and JSON API can accept:
- evaluation-set paths for local answer and retrieval checks
- OpenAI-compatible LLM endpoints for model-backed answers
Those are appropriate for local development, but they should be treated more carefully if you bind the app to a non-loopback host.
For structured output:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--question "How should a new support engineer request tool access?" \
--jsonTo write a small ingestion manifest:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--question "What should a shift handoff note include?" \
--manifest-out manifest.jsonTo focus on the plain-text handoff notes:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--extension .txt \
--top-k 1 \
--question "What should a shift handoff note include?" \
--jsonTo run a small evaluation set:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--eval-set examples/eval_set.json \
--retrieval-mode lexical \
--jsonTo run answer-quality evaluation:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--answer-eval-set examples/answer_eval_set.json \
--answer-mode extractive \
--jsonThe bundled evaluation set includes both easy and intentionally harder cases. Some of the harder cases are meant to expose retrieval weaknesses, not to guarantee a perfect score from the current baseline.
The default scaffold is intentionally simple:
- lexical retrieval uses token overlap
- vector retrieval uses deterministic hashed embeddings and cosine similarity
- reranking uses lightweight query expansion plus title/content overlap boosts
- extractive answers are assembled deterministically from retrieved chunks
- an OpenAI-compatible answer mode can generate a concise grounded answer from context
- the OpenAI-compatible path supports configurable HTTP timeouts
- citations come from source paths and chunk ids
- compact evidence snippets accompany each answer for UI and API consumers
- a small manifest summarizes the current corpus ingestion
- the local service refreshes the corpus when supported files change
- a local JSON index stores chunks, manifest data, and file signatures
- the local web app exposes JSON endpoints plus a basic query UI
- recent query, refresh, and evaluation activity is stored locally for inspection
That keeps the first version transparent enough for teaching.
- load
.mdand.txtfiles from a local folder - chunk text with configurable size and overlap
- retrieve the top matching chunks for a question
- switch between lexical and vector retrieval
- optionally apply lightweight reranking
- switch between extractive and OpenAI-compatible answer generation
- filter retrieval by source substring or file extension
- return a short grounded answer summary with citations
- return snippet-level evidence metadata alongside citations
- return an inspection summary explaining top source, score, and answer mode
- emit JSON output for downstream inspection
- write a small ingestion manifest as JSON
- run a small retrieval evaluation set
- run a small answer-quality evaluation set
- refresh the in-memory corpus when files are added or changed
- persist a local index and reuse it across restarts
- serve a basic local API for querying and manifest inspection
- serve a small browser UI for local querying
- keep a small recent-activity history for queries, refreshes, and evaluations
This project is meant to evolve in stages:
- lexical retrieval baseline
- embedding and vector retrieval baseline
- metadata filters
- lightweight reranking
- evaluation and debugging workflows
- freshness and re-indexing
- local API and query UI
- richer answer generation
- answer-quality evaluation
Run the tests:
pytestRun linting:
ruff check .Run type checks:
mypy srcThe starter web app exposes a few small endpoints:
GET /api/healthGET /api/historyGET /api/manifestPOST /api/history/clearPOST /api/indexPOST /api/queryPOST /api/refreshPOST /api/evaluate/retrievalPOST /api/evaluate/answers
Example query payload:
{
"question": "What must be removed before sharing screenshots outside the company?",
"retrieval_mode": "lexical",
"answer_mode": "extractive",
"reranking": "keyword",
"top_k": 3,
"extensions": [".md"]
}The LLM is optional in this project.
The flow is:
- the app retrieves grounded context from the local corpus
answer_mode: "extractive"builds a deterministic answer without a model callanswer_mode: "openai-compatible"sends the retrieved context to the configured LLM endpoint
That means the browser UI does not talk to the model directly.
The local web app receives the request, runs retrieval, and only then makes the
LLM call if you selected openai-compatible.
llama3.2 in the examples is just one sample model name.
It shows an optional local setup, such as using Ollama behind an
OpenAI-compatible endpoint.
You can replace it with any model id your chosen endpoint accepts.
The project uses the same answer-generation settings across every surface:
- answer mode
- LLM endpoint
- LLM model
- optional API-key environment variable
The settings stay the same, but each surface passes them differently:
| Surface | Where you set local model config |
|---|---|
| CLI | command flags such as --answer-mode, --llm-endpoint, and --llm-model |
| JSON API | request fields such as answer_mode, llm_endpoint, and llm_model |
| browser UI | form fields named Answer mode, LLM endpoint, and LLM model |
One practical difference is that the CLI config is set when you start the command, while the API and browser UI send LLM settings with each request.
For a local model server, the usual pattern is:
- run a local OpenAI-compatible endpoint such as
http://127.0.0.1:11434/v1/chat/completions - choose
answer_mode/--answer-modeasopenai-compatible - pass the model name your local server expects, such as
llama3.2
CLI example:
python -m docs_rag_app.cli \
--corpus examples/sample_corpus \
--answer-mode openai-compatible \
--llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
--llm-model llama3.2 \
--question "How should support phrase a reply when timing is uncertain?" \
--jsonIn the browser UI:
- run
python -m docs_rag_app.web --corpus examples/sample_corpus - open
http://127.0.0.1:8000 - set
Answer modetoopenai-compatible - fill in
LLM endpoint, for examplehttp://127.0.0.1:11434/v1/chat/completions - fill in
LLM model - run the query
Equivalent API request:
{
"question": "How should support phrase a reply when timing is uncertain?",
"retrieval_mode": "lexical",
"answer_mode": "openai-compatible",
"reranking": "keyword",
"top_k": 3,
"extensions": [".md"],
"llm_endpoint": "http://127.0.0.1:11434/v1/chat/completions",
"llm_model": "llama3.2"
}If your endpoint requires an API key, set OPENAI_API_KEY before starting the
web app. The JSON API also accepts llm_api_key_env if you want the server to
read a different environment variable name.
The same answer_mode, llm_endpoint, and llm_model fields also apply to
POST /api/evaluate/answers when you want to evaluate the model-backed path.
For the web API, eval_set_path is intentionally limited to JSON files inside
the current workspace or the corpus parent directory.
The web UI also includes a Recent Activity panel that shows the latest
queries, refreshes, and evaluation runs stored beside the local index.
The project now supports two evaluation loops:
- retrieval evaluation with
examples/eval_set.json - answer-quality evaluation with
examples/answer_eval_set.json
Those evaluation workflows are available from both:
- the CLI
- the local browser UI and JSON API
The answer-quality evaluation checks whether the app:
- cited the expected source
- abstained when it should
- included required answer phrases
- avoided forbidden answer phrases
By default the app stores its local index under a hidden directory beside the corpus, for example:
examples/.docs_rag_index/sample_corpus/index.json
You can override that location with --index-path in both the CLI and the web
entrypoint.
MIT