diff --git a/technical/rag.md b/technical/rag.md index 1b167b3..893ae19 100644 --- a/technical/rag.md +++ b/technical/rag.md @@ -34,8 +34,10 @@ at query time. Why the change: - **Control over extraction and chunking.** The ingestion-service runs a [Haystack v2](https://haystack.deepset.ai/) - pipeline with pluggable extraction engines (Tika, Kreuzberg, pypdf, …) and chunking strategies (token, markdown, - sentence, …) selected by configuration. + pipeline with pluggable extraction engines (`tika`, `kreuzberg`, `pypdf`, `docling`, `unstructured`) and chunking + strategies (token, markdown, sentence, …) selected by configuration. A content-based `auto` mode additionally routes + layout-bound documents (flowcharts, diagrams, scanned forms) to a multimodal **vision** path (`vision-llm` / + `hybrid-diagram`) while everything else takes a plain text extractor. - **Hybrid search and reranking.** Ingestion can write a sparse vector alongside the dense one, letting retrieval use Qdrant's native RRF hybrid query plus optional cross-encoder reranking. - **Agentic retrieval.** The retrieval-agent can wrap search in a [PydanticAI](https://ai.pydantic.dev/) loop that @@ -69,7 +71,9 @@ flowchart TB end S3[("S3 / MinIO
raw files")] - SIDE["Tika / Kreuzberg
extraction sidecar"] + SIDE["Tika / Kreuzberg
text extraction sidecars"] + GOT["Gotenberg
office→PDF render sidecar"] + VLM["Vision LLM endpoint
(multimodal, VISION_LLM_*)"] EMB["Embedding endpoint
(embed.itkdev.dk / TEI / fastembed)"] LLM["LiteLLM proxy
(agent + query generation)"] RRK["Reranker endpoint
(embed.itkdev.dk /v1/rerank)"] @@ -82,6 +86,8 @@ flowchart TB ING -->|fetch by key| S3 ING -->|HTTP extract| SIDE + ING -.vision engines.-> GOT + ING -.vision engines.-> VLM ING -->|embed docs| EMB ING -->|write points| QD @@ -91,8 +97,10 @@ flowchart TB RET -->|search points| QD ``` -Solid arrows are always-on; dashed arrows are optional (the LLM is only consulted in agentic mode or when query -generation is enabled; the reranker only when `ENABLE_RERANKING` is enabled). +Solid arrows are always-on; dashed arrows are optional. On the ingestion side, Gotenberg and the Vision LLM endpoint +are only used when a vision engine is selected (`vision-llm`, `hybrid-diagram`, or an `auto`-routed diagram). On the +retrieval side, the LLM is only consulted in agentic mode or when query generation is enabled, and the reranker only +when `ENABLE_RERANKING` is enabled. ## 3. The shared Qdrant contract @@ -112,10 +120,12 @@ The settings that must agree across ingestion-service and retrieval-agent: | Sparse model | `SPARSE_EMBEDDING_MODEL` (when sparse on) | `SPARSE_QUERY_MODEL` (when hybrid on) | Sparse query vectors must match the indexed ones | **Multitenancy.** Every chunk is written with a `meta.collection_name` payload (e.g. `file-abc`, or a knowledge-base -name) and Qdrant is configured for per-tenant subgraphs keyed on that field -(`hnsw_config={"m": 0, "payload_m": 16}`, with the keyword payload index created at ingestion startup). At query time -the retrieval-agent passes `collection_names` and filters on `meta.collection_name ∈ collection_names`, so one Open -WebUI knowledge base never bleeds into another even though they share one physical collection. +name) and Qdrant is configured for per-tenant subgraphs keyed on that field (`hnsw_config={"m": 0, "payload_m": 16}`). +The ingestion-service bootstraps the supporting keyword payload indexes at startup: `meta.collection_name` (the tenant +key, `is_tenant=True`), `meta.collection_type`, and `meta.languages` (ISO 639-1 codes surfaced by Kreuzberg, enabling +language filtering). At query time the retrieval-agent passes `collection_names` and filters on +`meta.collection_name ∈ collection_names`, so one Open WebUI knowledge base never bleeds into another even though they +share one physical collection. > The dense embedding model is the contract that breaks most quietly. If ingestion indexed with > `intfloat/multilingual-e5-large` and `"passage: "` prefixes, the retrieval-agent must query with the *same* model and @@ -125,7 +135,10 @@ WebUI knowledge base never bleeds into another even though they share one physic Open WebUI uploads the raw file to S3/MinIO first, then calls `PUT /api/v1/ingest` with the bucket and key (a multipart body is the fallback for direct uploads). The S3 path is preferred because it keeps large files off the FastAPI -worker's heap. +worker's heap. Requests are gated before any work happens: a per-`file_id` lock serializes concurrent ingests of the +same file, `MAX_UPLOAD_BYTES` (100 MB) caps both multipart parts and S3 fetches (checked via `head_object`), the +`S3_ALLOWED_BUCKETS` allow-list restricts which buckets may be read, and the caller's `collection_name` is validated +against the `user_id` / `file_id` it claims. The request handler hands the file to a Haystack v2 pipeline built once at startup and cached as module state: @@ -143,28 +156,86 @@ flowchart LR class SE optional; ``` -1. **Convert** - turn the raw file into Haystack `Document`s. The engine is chosen by `EXTRACTION_ENGINE`: `kreuzberg` - are HTTP sidecars in the parent stack, `pypdf` is in-process, `docling`/`unstructured` need optional - dependencies. Kreuzberg additionally surfaces document metadata (title, authors, languages) and renders tables as - Markdown. -2. **Chunk** - slice documents by `CHUNK_SPLIT_BY`. The default `token` mode measures chunk size in the embedding - model's actual tokens (important for e5-large's 512-token cap once the `"passage: "` prefix is added); `markdown` - mode splits on heading hierarchy and records a heading breadcrumb; `word`/`sentence`/`passage` use Haystack's - built-in splitter. Every chunk gets a sequential `meta.split_id`. +1. **Convert** - turn the raw file into Haystack `Document`s. The engine is chosen by `EXTRACTION_ENGINE`: `tika` and + `kreuzberg` are HTTP sidecars in the parent stack, `pypdf` is in-process (PDF-only), `docling`/`unstructured` need + optional dependencies, and `vision-llm`/`hybrid-diagram` are the multimodal vision engines. `auto` is not an engine + but a routing *mode* that picks one per document. Kreuzberg additionally surfaces document metadata (title, authors, + languages) and renders tables as Markdown. See [Vision extraction and content-based + routing](#vision-extraction-and-content-based-routing) below. +2. **Chunk** - slice documents by `CHUNK_SPLIT_BY` (deployed default `markdown`; code default `token`). `token` mode + measures chunk size in the embedding model's actual HuggingFace tokens (important for e5-large's 512-token cap once + the `"passage: "` prefix is added); `markdown` mode splits on heading hierarchy first, records the heading + breadcrumb in `meta.headers`, then token-packs each section over `CHUNK_SIZE`; `word`/`sentence`/`passage` use + Haystack's built-in splitter. `CHUNK_SIZE` (400) and `CHUNK_OVERLAP` (80) bound chunk length, and the tokenizer is + `TOKENIZER_MODEL` (falling back to `EMBEDDING_MODEL`) pinned to `TOKENIZER_REVISION` in deployments. Every chunk gets + a sequential `meta.split_id`. 3. **Embed (dense)** - required. `EMBEDDING_PROVIDER` selects `openai-compat` (the current `embed.itkdev.dk` path), - `fastembed` (in-process), or `tei`. `EMBEDDING_PREFIX_DOC` is prepended to each chunk. -4. **Embed (sparse)** - optional. When `ENABLE_SPARSE_EMBEDDINGS=true`, a second named vector (BM42 / SPLADE family) is - added per chunk so retrieval can use Qdrant's native RRF hybrid query instead of client-side BM25. + `fastembed` (in-process), or `tei`. `EMBEDDING_PREFIX_DOC` is prepended to each chunk, and `EMBEDDING_DIM` (1024 for + e5-large) sizes the dense vector Qdrant allocates. +4. **Embed (sparse)** - both deployed stacks run `ENABLE_SPARSE_EMBEDDINGS=true`, so a second named vector + (`text-sparse`, BM42 via `Qdrant/bm42-all-minilm-l6-v2-attentions`) is added per chunk, letting retrieval use + Qdrant's native RRF hybrid query instead of client-side BM25. Dense-only is the code default; when disabled the + writer sees dense-only chunks. 5. **Write** - `DocumentWriter` backed by `QdrantDocumentStore` writes one point per chunk carrying the dense vector - (and the sparse vector when enabled) as named vectors. + (`text-dense`, and `text-sparse` when enabled) as named vectors. **Idempotency.** With `overwrite=true` (the default), all existing points whose `meta.file_id` matches the request are -deleted before the new chunks are written; the same delete runs as teardown if any stage throws. So a `status: true` -response means the file is *fully* indexed, any other outcome means its chunks are absent (no partial writes leak), and -retrying the same `file_id` never duplicates vectors. Open WebUI's reindex action relies on this. +deleted before the new chunks are written; the same delete runs as teardown if any stage throws, and the whole +delete-then-write is guarded by the per-`file_id` lock so two concurrent requests can't interleave. So a +`status: true` response means the file is *fully* indexed, any other outcome means its chunks are absent (no partial +writes leak), and retrying the same `file_id` never duplicates vectors. Open WebUI's reindex action relies on this. The +success body is `{status: true, collection_name, chunks_count}`; a failure returns `{status: false, error, code}` where +`code` is one of `EXTRACTION_FAILED`, `EMBEDDING_FAILED`, `SPARSE_EMBEDDING_FAILED`, `QDRANT_WRITE_FAILED`, +`S3_FETCH_FAILED`, `INVALID_REQUEST`, or `PIPELINE_FAILED`. + +### Vision extraction and content-based routing + +Three engines cover documents whose meaning lives in their *layout* rather than their text — flowcharts, diagrams, and +scanned forms that a plain text extractor would flatten or drop: + +- **`vision-llm`** renders the document to page images and reconstructs it with a multimodal LLM. Office formats are + converted to PDF by the **Gotenberg** sidecar; the PDF is rasterized locally (pypdfium2) at `VISION_LLM_DPI` (150), + capped at `VISION_LLM_MAX_PAGES` (20); the pages go to `VISION_LLM_API_BASE_URL` (model `VISION_LLM_MODEL`) in a + single call. A prompt *profile* shapes the output — `diagram` (lanes/phases/steps plus a Mermaid flowchart), + `general` (faithful full-page Markdown), or `ocr` (plain-text transcription), with `diagram-topology` and `figure` + used internally by the hybrid engine. `VISION_LLM_LANGUAGE_HINT` (Danish) keeps the source language verbatim, and + output exceeding `VISION_LLM_MAX_TOKENS` fails the extraction rather than truncating silently. +- **`hybrid-diagram`** (docx only) pairs the *authoritative* native text read straight from the package XML (verbatim + labels, nothing OCR-guessed) with a vision-inferred Mermaid graph. It picks `diagram-topology` for a *vector* + flowchart (labels are Word shapes) or `figure` for a *raster* PNG diagram (labels are pixels); a non-docx input falls + through to plain `vision-llm`. It wraps `vision-llm`, so it needs the same `VISION_LLM_*` + Gotenberg config. This is + the default diagram engine for `auto`. +- **`auto`** routes each document by inspecting it (`detectors.py`). A `.docx` is sent to + `EXTRACTION_ROUTER_DIAGRAM_ENGINE` (default `hybrid-diagram`) when either signal fires; everything else goes to + `EXTRACTION_ROUTER_DEFAULT` (`kreuzberg` in deployments): + - **Vector flowchart** - drawing/text-box shapes ≥ `EXTRACTION_ROUTER_MIN_TEXTBOXES` (20) **and** drawing-to-body + word ratio ≥ `EXTRACTION_ROUTER_DRAWING_RATIO` (2.0). + - **Raster figure** - body images ≥ `EXTRACTION_ROUTER_MIN_BODY_IMAGES` (1) with the largest `` display + area ≥ `EXTRACTION_ROUTER_MIN_IMAGE_EMU` (`1_500_000_000_000` EMU² ≈ 1.79 in²). The area floor — not a count — + rejects decoration; header/footer logos are excluded for free because only `word/document.xml` is read. The opt-in + `EXTRACTION_ROUTER_MIN_IMAGE_WORD_RATIO` (0 = off) adds an image-area-to-body-words guard for corpora with large + decorative hero photos. -There is also a developer-facing `POST /api/v1/extract` probe that runs only the converter and returns the raw extracted -documents - useful for comparing extraction engines without a full ingest. +```mermaid +flowchart TD + A[document] --> B{".docx?"} + B -- no --> D[EXTRACTION_ROUTER_DEFAULT
kreuzberg] + B -- yes --> C{"vector-flowchart
OR raster-figure
signal?"} + C -- no --> D + C -- yes --> E[EXTRACTION_ROUTER_DIAGRAM_ENGINE
hybrid-diagram] + E --> F{"vector vs raster"} + F -- vector --> G[diagram-topology profile] + F -- raster --> H[figure profile] +``` + +**Seeing what was chosen.** The `/api/v1/extract` response echoes the `engine` and `profile` used, every chunk written +to Qdrant carries `meta.extractor` / `meta.vision_profile`, and `DEBUG=true` logs the per-document routing decision +(detector signals → engine/profile) to the container logs. + +There is also a developer-facing `POST /api/v1/extract` probe that runs only the converter and returns the raw +extracted documents - useful for comparing extraction engines without a full ingest. It takes an optional `engine` +override (any concrete engine, but not `auto`) and an optional `profile` (accepted only by the vision engines), and +responds with `{status, engine, profile, documents}`. ## 5. Retrieval path @@ -259,15 +330,34 @@ separate (`RETRIEVAL_AGENT_API_KEY` → `AGENT_API_KEY`). ### ingestion-service (selected) +Defaults below are the values the deployed AarhusAI stacks set; where they differ from the service's own +code/`.env.example` default, the Notes column says so. + | Variable | Default | Notes | | --- | --- | --- | | `API_KEY` | *(required)* | Must equal Open WebUI's `EXTERNAL_INGESTION_API_KEY` | | `QDRANT_INDEX` | `ingestion_files` | Physical collection; must match retrieval-agent | | `EMBEDDING_MODEL` | `intfloat/multilingual-e5-large` | Must match retrieval-agent | +| `EMBEDDING_DIM` | `1024` | Dense vector size; must match the model and retrieval-agent | | `EMBEDDING_PREFIX_DOC` | `"passage: "` | Doc-side prefix (keep the trailing space) | -| `EXTRACTION_ENGINE` | `tika` | `tika` / `pypdf` / `kreuzberg` day-one; `docling` / `unstructured` optional | -| `CHUNK_SPLIT_BY` | `token` | `token` / `markdown` / `word` / `sentence` / `passage` | -| `ENABLE_SPARSE_EMBEDDINGS` | `false` | Adds the sparse vector that enables native hybrid retrieval | +| `EXTRACTION_ENGINE` | `auto` (dev) / `kreuzberg` (server) | Full set `tika`/`pypdf`/`kreuzberg`/`docling`/`unstructured`/`vision-llm`/`hybrid-diagram`/`auto`; code default `tika` | +| `EXTRACTION_ROUTER_DEFAULT` | `kreuzberg` | Non-diagram engine, consulted only when `EXTRACTION_ENGINE=auto` | +| `EXTRACTION_ROUTER_DIAGRAM_ENGINE` | `hybrid-diagram` | Diagram engine, consulted only when `EXTRACTION_ENGINE=auto` | +| `VISION_LLM_API_BASE_URL` | *(empty)* | Multimodal endpoint; empty disables the vision engines | +| `VISION_LLM_MODEL` | *(endpoint-specific)* | Model served by the vision endpoint (`gemma4-nvfp4` code default) | +| `GOTENBERG_URL` | `http://gotenberg:3000` | office→PDF sidecar for the vision engines | +| `CHUNK_SPLIT_BY` | `markdown` | `token`/`markdown`/`word`/`sentence`/`passage`; code default `token` | +| `CHUNK_SIZE` | `400` | Chunk length (HF tokens in token/markdown modes) | +| `CHUNK_OVERLAP` | `80` | Overlap between chunks | +| `TOKENIZER_REVISION` | *(pinned)* | Deploys pin the e5-large tokenizer SHA; empty = Hub HEAD | +| `ENABLE_SPARSE_EMBEDDINGS` | `true` | Adds the sparse vector enabling native hybrid retrieval; code default `false` | +| `SPARSE_EMBEDDING_MODEL` | `Qdrant/bm42-all-minilm-l6-v2-attentions` | BM42; must match retrieval-agent when hybrid is on | +| `S3_ALLOWED_BUCKETS` | `openwebui` | Allow-list of buckets the service may fetch from (empty = unenforced) | + +The vision engines (`vision-llm` / `hybrid-diagram`) and the Gotenberg sidecar run on the dev stack +(`EXTRACTION_ENGINE=auto`); the server stack uses plain `kreuzberg` and ships neither. Both stacks enable sparse +embeddings and `markdown` chunking, so the code defaults (`tika`, dense-only, `token`) describe an unconfigured service +rather than either deployment. ### retrieval-agent (selected) @@ -289,11 +379,15 @@ Both services expose the same probe pair: - `GET /health` - liveness; 200 whenever the process is up. - `GET /health/ready` - readiness; 503 until dependencies are reachable. The ingestion-service additionally waits for - its Haystack pipeline to warm up (the sparse embedder pulls its model from HuggingFace on first boot, ~80 MB); the - retrieval-agent verifies Qdrant connectivity. This keeps Docker/Kubernetes from routing traffic during cold start. + its Haystack pipeline to warm up (with sparse enabled — the deployed norm — the BM42 embedder pulls its ~80 MB model + from HuggingFace, cached in the `ingestion_model_cache` volume so it survives restarts); the retrieval-agent verifies + Qdrant connectivity. This keeps Docker/Kubernetes from routing traffic during cold start. Readiness gates only the + pipeline warm-up and Qdrant — Gotenberg and the Vision LLM endpoint are *not* probed, so a vision dependency being + down surfaces per-request as an `EXTRACTION_FAILED` rather than blocking startup. ## 8. Reference - ingestion-service - - retrieval-agent - +- [Gotenberg](https://gotenberg.dev/) - office→PDF render sidecar used by the vision extraction engines - [Patches](./patches.md) - the Open WebUI patches applied in this stack