Skip to content

nextframedev/docs_rag_app

Repository files navigation

docs_rag_app

docs_rag_app is a small companion project for the Practical Tech Guide series book Building RAG Applications.

The goal is to provide a narrow, readable starter project that can grow with the book:

  • load local Markdown and text documents
  • split them into chunks
  • retrieve relevant chunks for a question
  • assemble a grounded answer with citations

The current scaffold starts with lexical retrieval, includes a small embedding-backed vector path, and adds a lightweight reranking step so the project can compare multiple retrieval stages without introducing heavy dependencies too early.

The bundled sample corpus is a small fictional support-team handbook, not book content. It is original example content written for this project, and it gives the app a realistic mix of Markdown and plain-text documents to search over during the examples below.

Project Shape

The starter project is split into a few small modules:

  • loaders.py load and normalize local documents
  • chunking.py split documents into small retrievable chunks
  • retrieval.py rank chunks for a question
  • generation.py turn retrieved chunks into a grounded answer
  • service.py cache the ingested corpus and refresh it when files change
  • index_store.py persist the local index so later queries can reuse it
  • pipeline.py run a simple retrieve-and-answer flow
  • cli.py provide a small local command-line interface
  • web.py provide a small local API and browser query UI

Quick Start

Clone the repo and set up the local environment:

git clone https://github.com/nextframedev/docs_rag_app.git
cd docs_rag_app
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
pytest
python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --question "Who should publish external updates during a sev1 incident?"

Try the starter vector path:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --retrieval-mode vector \
  --question "Who should publish external updates during a sev1 incident?"

Try the lightweight reranker:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --retrieval-mode lexical \
  --reranking keyword \
  --question "Who should publish external updates during a sev1 incident?" \
  --json

Try the grounded extractive answer mode explicitly:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-mode extractive \
  --question "What must be removed before sharing screenshots outside the company?" \
  --json

Try an OpenAI-compatible answer step:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-mode openai-compatible \
  --llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
  --llm-model llama3.2 \
  --question "How should support phrase a reply when timing is uncertain?" \
  --json

Compare extractive and OpenAI-compatible answer evaluation side by side:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-eval-set examples/answer_eval_set.json \
  --compare-answer-modes \
  --answer-mode openai-compatible \
  --llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
  --llm-model llama3.2 \
  --json

Build or refresh the persisted local index:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --build-index

Run the local API and browser UI:

python -m docs_rag_app.web --corpus examples/sample_corpus

Then open http://127.0.0.1:8000.

Keep the default loopback host unless you have a specific reason to expose the local web app on your network. The browser UI and JSON API can accept:

  • evaluation-set paths for local answer and retrieval checks
  • OpenAI-compatible LLM endpoints for model-backed answers

Those are appropriate for local development, but they should be treated more carefully if you bind the app to a non-loopback host.

For structured output:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --question "How should a new support engineer request tool access?" \
  --json

To write a small ingestion manifest:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --question "What should a shift handoff note include?" \
  --manifest-out manifest.json

To focus on the plain-text handoff notes:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --extension .txt \
  --top-k 1 \
  --question "What should a shift handoff note include?" \
  --json

To run a small evaluation set:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --eval-set examples/eval_set.json \
  --retrieval-mode lexical \
  --json

To run answer-quality evaluation:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-eval-set examples/answer_eval_set.json \
  --answer-mode extractive \
  --json

The bundled evaluation set includes both easy and intentionally harder cases. Some of the harder cases are meant to expose retrieval weaknesses, not to guarantee a perfect score from the current baseline.

The default scaffold is intentionally simple:

  • lexical retrieval uses token overlap
  • vector retrieval uses deterministic hashed embeddings and cosine similarity
  • reranking uses lightweight query expansion plus title/content overlap boosts
  • extractive answers are assembled deterministically from retrieved chunks
  • an OpenAI-compatible answer mode can generate a concise grounded answer from context
  • the OpenAI-compatible path supports configurable HTTP timeouts
  • citations come from source paths and chunk ids
  • compact evidence snippets accompany each answer for UI and API consumers
  • a small manifest summarizes the current corpus ingestion
  • the local service refreshes the corpus when supported files change
  • a local JSON index stores chunks, manifest data, and file signatures
  • the local web app exposes JSON endpoints plus a basic query UI
  • recent query, refresh, and evaluation activity is stored locally for inspection

That keeps the first version transparent enough for teaching.

Current Capabilities

  • load .md and .txt files from a local folder
  • chunk text with configurable size and overlap
  • retrieve the top matching chunks for a question
  • switch between lexical and vector retrieval
  • optionally apply lightweight reranking
  • switch between extractive and OpenAI-compatible answer generation
  • filter retrieval by source substring or file extension
  • return a short grounded answer summary with citations
  • return snippet-level evidence metadata alongside citations
  • return an inspection summary explaining top source, score, and answer mode
  • emit JSON output for downstream inspection
  • write a small ingestion manifest as JSON
  • run a small retrieval evaluation set
  • run a small answer-quality evaluation set
  • refresh the in-memory corpus when files are added or changed
  • persist a local index and reuse it across restarts
  • serve a basic local API for querying and manifest inspection
  • serve a small browser UI for local querying
  • keep a small recent-activity history for queries, refreshes, and evaluations

Intended Growth Path

This project is meant to evolve in stages:

  1. lexical retrieval baseline
  2. embedding and vector retrieval baseline
  3. metadata filters
  4. lightweight reranking
  5. evaluation and debugging workflows
  6. freshness and re-indexing
  7. local API and query UI
  8. richer answer generation
  9. answer-quality evaluation

Development

Run the tests:

pytest

Run linting:

ruff check .

Run type checks:

mypy src

Local API

The starter web app exposes a few small endpoints:

  • GET /api/health
  • GET /api/history
  • GET /api/manifest
  • POST /api/history/clear
  • POST /api/index
  • POST /api/query
  • POST /api/refresh
  • POST /api/evaluate/retrieval
  • POST /api/evaluate/answers

Example query payload:

{
  "question": "What must be removed before sharing screenshots outside the company?",
  "retrieval_mode": "lexical",
  "answer_mode": "extractive",
  "reranking": "keyword",
  "top_k": 3,
  "extensions": [".md"]
}

How The LLM Fits In

The LLM is optional in this project.

The flow is:

  1. the app retrieves grounded context from the local corpus
  2. answer_mode: "extractive" builds a deterministic answer without a model call
  3. answer_mode: "openai-compatible" sends the retrieved context to the configured LLM endpoint

That means the browser UI does not talk to the model directly. The local web app receives the request, runs retrieval, and only then makes the LLM call if you selected openai-compatible.

llama3.2 in the examples is just one sample model name. It shows an optional local setup, such as using Ollama behind an OpenAI-compatible endpoint. You can replace it with any model id your chosen endpoint accepts.

Local Model Config Across CLI, API, And UI

The project uses the same answer-generation settings across every surface:

  • answer mode
  • LLM endpoint
  • LLM model
  • optional API-key environment variable

The settings stay the same, but each surface passes them differently:

Surface Where you set local model config
CLI command flags such as --answer-mode, --llm-endpoint, and --llm-model
JSON API request fields such as answer_mode, llm_endpoint, and llm_model
browser UI form fields named Answer mode, LLM endpoint, and LLM model

One practical difference is that the CLI config is set when you start the command, while the API and browser UI send LLM settings with each request.

For a local model server, the usual pattern is:

  • run a local OpenAI-compatible endpoint such as http://127.0.0.1:11434/v1/chat/completions
  • choose answer_mode / --answer-mode as openai-compatible
  • pass the model name your local server expects, such as llama3.2

CLI example:

python -m docs_rag_app.cli \
  --corpus examples/sample_corpus \
  --answer-mode openai-compatible \
  --llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
  --llm-model llama3.2 \
  --question "How should support phrase a reply when timing is uncertain?" \
  --json

In the browser UI:

  1. run python -m docs_rag_app.web --corpus examples/sample_corpus
  2. open http://127.0.0.1:8000
  3. set Answer mode to openai-compatible
  4. fill in LLM endpoint, for example http://127.0.0.1:11434/v1/chat/completions
  5. fill in LLM model
  6. run the query

Equivalent API request:

{
  "question": "How should support phrase a reply when timing is uncertain?",
  "retrieval_mode": "lexical",
  "answer_mode": "openai-compatible",
  "reranking": "keyword",
  "top_k": 3,
  "extensions": [".md"],
  "llm_endpoint": "http://127.0.0.1:11434/v1/chat/completions",
  "llm_model": "llama3.2"
}

If your endpoint requires an API key, set OPENAI_API_KEY before starting the web app. The JSON API also accepts llm_api_key_env if you want the server to read a different environment variable name.

The same answer_mode, llm_endpoint, and llm_model fields also apply to POST /api/evaluate/answers when you want to evaluate the model-backed path.

For the web API, eval_set_path is intentionally limited to JSON files inside the current workspace or the corpus parent directory.

The web UI also includes a Recent Activity panel that shows the latest queries, refreshes, and evaluation runs stored beside the local index.

Evaluation Notes

The project now supports two evaluation loops:

  • retrieval evaluation with examples/eval_set.json
  • answer-quality evaluation with examples/answer_eval_set.json

Those evaluation workflows are available from both:

  • the CLI
  • the local browser UI and JSON API

The answer-quality evaluation checks whether the app:

  • cited the expected source
  • abstained when it should
  • included required answer phrases
  • avoided forbidden answer phrases

Indexing Notes

By default the app stores its local index under a hidden directory beside the corpus, for example:

examples/.docs_rag_index/sample_corpus/index.json

You can override that location with --index-path in both the CLI and the web entrypoint.

License

MIT

Books by the Authors

QR code to our books on Amazon
Scan to check out our books on Amazon

About

A small, readable RAG starter for local Markdown/text corpora — lexical + vector retrieval, reranking, grounded answers with citations.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages