docs_rag_app

docs_rag_app is a small companion project for the Practical Tech Guide series book Building RAG Applications.

The goal is to provide a narrow, readable starter project that can grow with the book:

load local Markdown and text documents
split them into chunks
retrieve relevant chunks for a question
assemble a grounded answer with citations

The current scaffold starts with lexical retrieval, includes a small embedding-backed vector path, and adds a lightweight reranking step so the project can compare multiple retrieval stages without introducing heavy dependencies too early.

The bundled sample corpus is a small fictional support-team handbook, not book content. It is original example content written for this project, and it gives the app a realistic mix of Markdown and plain-text documents to search over during the examples below.

Project Shape

The starter project is split into a few small modules:

loaders.py load and normalize local documents
chunking.py split documents into small retrievable chunks
retrieval.py rank chunks for a question
generation.py turn retrieved chunks into a grounded answer
service.py cache the ingested corpus and refresh it when files change
index_store.py persist the local index so later queries can reuse it
pipeline.py run a simple retrieve-and-answer flow
cli.py provide a small local command-line interface
web.py provide a small local API and browser query UI

Quick Start

Clone the repo and set up the local environment:

git clone https://github.com/nextframedev/docs_rag_app.git
cd docs_rag_app
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
pytest
python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --question "Who should publish external updates during a sev1 incident?"

Try the starter vector path:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --retrieval-mode vector \
  --question "Who should publish external updates during a sev1 incident?"

Try the lightweight reranker:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --retrieval-mode lexical \
  --reranking keyword \
  --question "Who should publish external updates during a sev1 incident?" \
  --json

Try the grounded extractive answer mode explicitly:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-mode extractive \
  --question "What must be removed before sharing screenshots outside the company?" \
  --json

Try an OpenAI-compatible answer step:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-mode openai-compatible \
  --llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
  --llm-model llama3.2 \
  --question "How should support phrase a reply when timing is uncertain?" \
  --json

Compare extractive and OpenAI-compatible answer evaluation side by side:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-eval-set examples/answer_eval_set.json \
  --compare-answer-modes \
  --answer-mode openai-compatible \
  --llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
  --llm-model llama3.2 \
  --json

Build or refresh the persisted local index:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --build-index

Run the local API and browser UI:

python -m docs_rag_app.web --corpus examples/sample_corpus

Then open http://127.0.0.1:8000.

Keep the default loopback host unless you have a specific reason to expose the local web app on your network. The browser UI and JSON API can accept:

evaluation-set paths for local answer and retrieval checks
OpenAI-compatible LLM endpoints for model-backed answers

Those are appropriate for local development, but they should be treated more carefully if you bind the app to a non-loopback host.

For structured output:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --question "How should a new support engineer request tool access?" \
  --json

To write a small ingestion manifest:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --question "What should a shift handoff note include?" \
  --manifest-out manifest.json

To focus on the plain-text handoff notes:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --extension .txt \
  --top-k 1 \
  --question "What should a shift handoff note include?" \
  --json

To run a small evaluation set:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --eval-set examples/eval_set.json \
  --retrieval-mode lexical \
  --json

To run answer-quality evaluation:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-eval-set examples/answer_eval_set.json \
  --answer-mode extractive \
  --json

The bundled evaluation set includes both easy and intentionally harder cases. Some of the harder cases are meant to expose retrieval weaknesses, not to guarantee a perfect score from the current baseline.

The default scaffold is intentionally simple:

lexical retrieval uses token overlap
vector retrieval uses deterministic hashed embeddings and cosine similarity
reranking uses lightweight query expansion plus title/content overlap boosts
extractive answers are assembled deterministically from retrieved chunks
an OpenAI-compatible answer mode can generate a concise grounded answer from context
the OpenAI-compatible path supports configurable HTTP timeouts
citations come from source paths and chunk ids
compact evidence snippets accompany each answer for UI and API consumers
a small manifest summarizes the current corpus ingestion
the local service refreshes the corpus when supported files change
a local JSON index stores chunks, manifest data, and file signatures
the local web app exposes JSON endpoints plus a basic query UI
recent query, refresh, and evaluation activity is stored locally for inspection

That keeps the first version transparent enough for teaching.

Current Capabilities

load .md and .txt files from a local folder
chunk text with configurable size and overlap
retrieve the top matching chunks for a question
switch between lexical and vector retrieval
optionally apply lightweight reranking
switch between extractive and OpenAI-compatible answer generation
filter retrieval by source substring or file extension
return a short grounded answer summary with citations
return snippet-level evidence metadata alongside citations
return an inspection summary explaining top source, score, and answer mode
emit JSON output for downstream inspection
write a small ingestion manifest as JSON
run a small retrieval evaluation set
run a small answer-quality evaluation set
refresh the in-memory corpus when files are added or changed
persist a local index and reuse it across restarts
serve a basic local API for querying and manifest inspection
serve a small browser UI for local querying
keep a small recent-activity history for queries, refreshes, and evaluations

Intended Growth Path

This project is meant to evolve in stages:

lexical retrieval baseline
embedding and vector retrieval baseline
metadata filters
lightweight reranking
evaluation and debugging workflows
freshness and re-indexing
local API and query UI
richer answer generation
answer-quality evaluation

Development

Run the tests:

pytest

Run linting:

ruff check .

Run type checks:

mypy src

Local API

The starter web app exposes a few small endpoints:

GET /api/health
GET /api/history
GET /api/manifest
POST /api/history/clear
POST /api/index
POST /api/query
POST /api/refresh
POST /api/evaluate/retrieval
POST /api/evaluate/answers

Example query payload:

{
  "question": "What must be removed before sharing screenshots outside the company?",
  "retrieval_mode": "lexical",
  "answer_mode": "extractive",
  "reranking": "keyword",
  "top_k": 3,
  "extensions": [".md"]
}

How The LLM Fits In

The LLM is optional in this project.

The flow is:

the app retrieves grounded context from the local corpus
answer_mode: "extractive" builds a deterministic answer without a model call
answer_mode: "openai-compatible" sends the retrieved context to the configured LLM endpoint

That means the browser UI does not talk to the model directly. The local web app receives the request, runs retrieval, and only then makes the LLM call if you selected openai-compatible.

llama3.2 in the examples is just one sample model name. It shows an optional local setup, such as using Ollama behind an OpenAI-compatible endpoint. You can replace it with any model id your chosen endpoint accepts.

Local Model Config Across CLI, API, And UI

The project uses the same answer-generation settings across every surface:

answer mode
LLM endpoint
LLM model
optional API-key environment variable

The settings stay the same, but each surface passes them differently:

Surface	Where you set local model config
CLI	command flags such as `--answer-mode`, `--llm-endpoint`, and `--llm-model`
JSON API	request fields such as `answer_mode`, `llm_endpoint`, and `llm_model`
browser UI	form fields named `Answer mode`, `LLM endpoint`, and `LLM model`

One practical difference is that the CLI config is set when you start the command, while the API and browser UI send LLM settings with each request.

For a local model server, the usual pattern is:

run a local OpenAI-compatible endpoint such as http://127.0.0.1:11434/v1/chat/completions
choose answer_mode / --answer-mode as openai-compatible
pass the model name your local server expects, such as llama3.2

CLI example:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-mode openai-compatible \
  --llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
  --llm-model llama3.2 \
  --question "How should support phrase a reply when timing is uncertain?" \
  --json

In the browser UI:

run python -m docs_rag_app.web --corpus examples/sample_corpus
open http://127.0.0.1:8000
set Answer mode to openai-compatible
fill in LLM endpoint, for example http://127.0.0.1:11434/v1/chat/completions
fill in LLM model
run the query

Equivalent API request:

{
  "question": "How should support phrase a reply when timing is uncertain?",
  "retrieval_mode": "lexical",
  "answer_mode": "openai-compatible",
  "reranking": "keyword",
  "top_k": 3,
  "extensions": [".md"],
  "llm_endpoint": "http://127.0.0.1:11434/v1/chat/completions",
  "llm_model": "llama3.2"
}

If your endpoint requires an API key, set OPENAI_API_KEY before starting the web app. The JSON API also accepts llm_api_key_env if you want the server to read a different environment variable name.

The same answer_mode, llm_endpoint, and llm_model fields also apply to POST /api/evaluate/answers when you want to evaluate the model-backed path.

For the web API, eval_set_path is intentionally limited to JSON files inside the current workspace or the corpus parent directory.

The web UI also includes a Recent Activity panel that shows the latest queries, refreshes, and evaluation runs stored beside the local index.

Evaluation Notes

The project now supports two evaluation loops:

retrieval evaluation with examples/eval_set.json
answer-quality evaluation with examples/answer_eval_set.json

Those evaluation workflows are available from both:

the CLI
the local browser UI and JSON API

The answer-quality evaluation checks whether the app:

cited the expected source
abstained when it should
included required answer phrases
avoided forbidden answer phrases

Indexing Notes

By default the app stores its local index under a hidden directory beside the corpus, for example:

examples/.docs_rag_index/sample_corpus/index.json

You can override that location with --index-path in both the CLI and the web entrypoint.

License

MIT

Books by the Authors

Scan to check out our books on Amazon

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
src/docs_rag_app		src/docs_rag_app
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docs_rag_app

Project Shape

Quick Start

Current Capabilities

Intended Growth Path

Development

Local API

How The LLM Fits In

Local Model Config Across CLI, API, And UI

Evaluation Notes

Indexing Notes

License

Books by the Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docs_rag_app

Project Shape

Quick Start

Current Capabilities

Intended Growth Path

Development

Local API

How The LLM Fits In

Local Model Config Across CLI, API, And UI

Evaluation Notes

Indexing Notes

License

Books by the Authors

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages