Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 48 additions & 29 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Knowhere Python SDK — Usage Guide

> **Recent changes:** Chunk metadata fields (`tokens`, `keywords`, `summary`,
> `length`, etc.) are no longer flattened to the chunk surface. Access them
> through `chunk.metadata` instead. See [Chunk Types](#chunk-types).

Comprehensive reference for every feature, parameter, and pattern in the SDK.

## Table of Contents
Expand Down Expand Up @@ -219,8 +223,13 @@ result.table_chunks # List[TableChunk]
# Lookup by ID
chunk = result.getChunk("chunk_42")

# Hierarchy data (document structure tree, if available)
result.hierarchy
# Document navigation tree (from doc_nav.json, current worker output)
result.doc_nav # DocNav | None
result.doc_nav.sections # List[DocNavSection] — tree of titles/paths/levels
result.doc_nav.resources # DocNavResources — image/table resource summaries

# Legacy hierarchy (from hierarchy.json, older worker output)
result.hierarchy # Any | None

# Raw ZIP bytes (for archival)
result.raw_zip
Expand All @@ -239,63 +248,56 @@ result.save("./output/report/")

## Chunk Types

Every chunk shares a base set of fields (`chunk_id`, `type`, `content`, `path`). Each type adds its own fields.
Every chunk shares a base set of fields (`chunk_id`, `type`, `content`, `path`,
`metadata`). Worker metadata is kept in the `metadata` dict — it is **not**
flattened to top-level chunk properties.

### TextChunk
### Base fields (all chunk types)

| Field | Type | Description |
|-------|------|-------------|
| `chunk_id` | `str` | Unique identifier |
| `type` | `str` | Always `"text"` |
| `content` | `str` | The text content |
| `path` | `str \| None` | Document structure path (e.g. `"Section 1 > Subsection 2"`) |
| `length` | `int` | Character count |
| `tokens` | `List[str] \| None` | Tokenized words returned by the parser pipeline |
| `keywords` | `List[str] \| None` | Extracted keywords (requires `summary_txt: True`) |
| `summary` | `str \| None` | AI-generated summary (requires `summary_txt: True`) |
| `relationships` | `List \| None` | Relationships to other chunks |
| `type` | `str` | `"text"`, `"image"`, or `"table"` |
| `content` | `str` | Text content or placeholder |
| `path` | `str \| None` | Document structure path |
| `metadata` | `dict` | Raw worker metadata (tokens, keywords, summary, length, page_nums, etc.) |

### TextChunk

```python
for chunk in result.text_chunks:
print(f"[{chunk.chunk_id}] {chunk.content[:60]}...")
if chunk.keywords:
print(f" Keywords: {', '.join(chunk.keywords)}")
if chunk.summary:
print(f" Summary: {chunk.summary}")
# Metadata is in chunk.metadata, not flattened:
keywords = chunk.metadata.get("keywords", [])
summary = chunk.metadata.get("summary")
if keywords:
print(f" Keywords: {', '.join(keywords)}")
if summary:
print(f" Summary: {summary}")
```

### ImageChunk

| Field | Type | Description |
|-------|------|-------------|
| `chunk_id` | `str` | Unique identifier |
| `type` | `str` | Always `"image"` |
| `content` | `str` | Text content associated with the image |
| `file_path` | `str \| None` | Path within the ZIP |
| `original_name` | `str \| None` | Original filename |
| `summary` | `str \| None` | AI-generated image description (requires `summary_image: True`) |
| `data` | `bytes` | Raw image bytes (loaded from ZIP) |
| `format` | `str \| None` | Image format inferred from extension (property) |

```python
for img in result.image_chunks:
print(f"{img.file_path} ({len(img.data)} bytes, {img.format})")
if img.summary:
print(f" Description: {img.summary}")
summary = img.metadata.get("summary")
if summary:
print(f" Description: {summary}")
img.save("./output/images/") # writes to disk
```

### TableChunk

| Field | Type | Description |
|-------|------|-------------|
| `chunk_id` | `str` | Unique identifier |
| `type` | `str` | Always `"table"` |
| `content` | `str` | Text representation of the table |
| `file_path` | `str \| None` | Path within the ZIP |
| `original_name` | `str \| None` | Original filename |
| `table_type` | `str \| None` | Table classification |
| `summary` | `str \| None` | AI-generated table summary (requires `summary_table: True`) |
| `html` | `str` | Full HTML of the table (loaded from ZIP) |

```python
Expand Down Expand Up @@ -471,6 +473,19 @@ response = client.retrieval.query(
top_k=5,
)

# Agentic mode (LLM navigation + answer synthesis)
response = client.retrieval.query(
namespace="support-center",
query="How do I pair a Bluetooth headset?",
use_agentic=True,
top_k=5,
)
print(response.answer_text) # LLM-generated natural-language answer
print(response.router_used) # "workflow_single_step", "small_kb_all", etc.
for ref in response.referenced_chunks:
print(ref.get("chunk_id"), ref.get("asset_url"))

# Legacy results are always available
for result in response.results:
print(result.content)
print(result.score)
Expand All @@ -479,6 +494,10 @@ for result in response.results:
print(result.source.section_path)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `use_agentic` | `bool \| None` | `None` | Force agentic (`True`) or legacy (`False`) retrieval. `None` uses server default. |

Retrieval results expose `content`, not the older parse-result `text` field.
Media results may include `asset_url` when the server can sign the referenced
artifact.
Expand Down
77 changes: 18 additions & 59 deletions src/knowhere/lib/result_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@
from knowhere._logging import getLogger
from knowhere.types.result import (
Chunk,
DocNav,
ImageChunk,
Manifest,
ParseResult,
SlimChunk,
TableChunk,
TextChunk,
TextChunkTokens,
)

_logger = getLogger()
Expand Down Expand Up @@ -81,38 +81,6 @@ def _extractFilePath(raw: Dict[str, Any]) -> Optional[str]:
return fallback


def _normalizeTokenList(raw_tokens: List[Any]) -> List[str]:
"""Return a string-only token list with empty values removed."""
normalized_tokens: List[str] = []
for raw_token in raw_tokens:
token_text: str = str(raw_token).strip()
if token_text:
normalized_tokens.append(token_text)
return normalized_tokens


def _parseTextChunkTokens(
raw_tokens: Any,
*,
chunk_id: str,
) -> Optional[TextChunkTokens]:
"""Normalize text chunk tokens from the current backend payload."""
if raw_tokens is None:
return None
if isinstance(raw_tokens, bool):
raise KnowhereError(
f"Invalid tokens payload for text chunk '{chunk_id}': expected list[str], got bool."
)
if isinstance(raw_tokens, list):
return _normalizeTokenList(raw_tokens)

raise KnowhereError(
"Invalid tokens payload for text chunk "
f"'{chunk_id}': expected list[str], "
f"got {type(raw_tokens).__name__}."
)


def _buildChunks(
raw_chunks: List[Dict[str, Any]],
zf: zipfile.ZipFile,
Expand All @@ -125,58 +93,39 @@ def _buildChunks(

if chunk_type == "image":
image_data: bytes = b""
# file_path may be at top level, inside metadata, or use path as fallback
file_path: Optional[str] = _extractFilePath(raw)
if file_path:
image_data = _readZipBytes(zf, file_path) or b""
metadata: Dict[str, Any] = raw.get("metadata", {})
chunk: Chunk = ImageChunk(
chunk_id=raw.get("chunk_id", ""),
type="image",
content=raw.get("content", ""),
path=raw.get("path"),
page_nums=metadata.get("page_nums", raw.get("page_nums")),
length=metadata.get("length", raw.get("length", 0)),
file_path=file_path,
original_name=metadata.get("original_name", raw.get("original_name")),
summary=metadata.get("summary", raw.get("summary")),
data=image_data,
metadata=raw.get("metadata", {}),
)
elif chunk_type == "table":
table_html: str = ""
file_path = _extractFilePath(raw)
if file_path:
table_html = _readZipText(zf, file_path) or ""
metadata = raw.get("metadata", {})
chunk = TableChunk(
chunk_id=raw.get("chunk_id", ""),
type="table",
content=raw.get("content", ""),
path=raw.get("path"),
page_nums=metadata.get("page_nums", raw.get("page_nums")),
length=metadata.get("length", raw.get("length", 0)),
file_path=file_path,
original_name=metadata.get("original_name", raw.get("original_name")),
table_type=metadata.get("table_type", raw.get("table_type")),
summary=metadata.get("summary", raw.get("summary")),
html=table_html,
metadata=raw.get("metadata", {}),
)
else:
metadata = raw.get("metadata", {})
chunk_id: str = raw.get("chunk_id", "")
raw_tokens: Any = metadata.get("tokens", raw.get("tokens"))
chunk = TextChunk(
chunk_id=chunk_id,
chunk_id=raw.get("chunk_id", ""),
type="text",
content=raw.get("content", ""),
path=raw.get("path"),
page_nums=metadata.get("page_nums", raw.get("page_nums")),
length=metadata.get("length", raw.get("length", 0)),
tokens=_parseTextChunkTokens(raw_tokens, chunk_id=chunk_id),
keywords=metadata.get("keywords", raw.get("keywords")),
summary=metadata.get("summary", raw.get("summary")),
connect_to=metadata.get("connect_to", raw.get("connect_to")),
relationships=metadata.get("relationships", raw.get("relationships")),
metadata=raw.get("metadata", {}),
)

chunks.append(chunk)
Expand Down Expand Up @@ -229,7 +178,15 @@ def parseResultZip(
# -- Full markdown --
full_markdown: str = _readZipText(zf, "full.md") or ""

# -- Hierarchy --
# -- DocNav (current worker output) --
doc_nav_text: Optional[str] = _readZipText(zf, "doc_nav.json")
doc_nav: Optional[DocNav] = (
DocNav.model_validate(json.loads(doc_nav_text))
if doc_nav_text
else None
)

# -- Hierarchy (legacy — current worker no longer emits this) --
hierarchy_text: Optional[str] = _readZipText(zf, "hierarchy.json")
hierarchy: Optional[Any] = (
json.loads(hierarchy_text) if hierarchy_text else None
Expand Down Expand Up @@ -263,11 +220,13 @@ def parseResultZip(
return ParseResult(
manifest=manifest,
chunks=chunks,
chunks_slim=chunks_slim,
full_markdown=full_markdown,
raw_zip=zip_bytes,
doc_nav=doc_nav,
# Legacy — the current worker no longer emits these files
chunks_slim=chunks_slim,
hierarchy=hierarchy,
toc_hierarchies=toc_hierarchies,
kb_csv=kb_csv,
hierarchy_view_html=hierarchy_view_html,
raw_zip=zip_bytes,
)
6 changes: 6 additions & 0 deletions src/knowhere/resources/retrieval.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ def query(
query: str,
namespace: Optional[str] = None,
top_k: Optional[int] = None,
use_agentic: Optional[bool] = None,
data_type: Optional[int] = None,
signal_paths: Optional[list[str]] = None,
filter_mode: Optional[RetrievalFilterMode] = None,
Expand All @@ -39,6 +40,8 @@ def query(
body["namespace"] = namespace
if top_k is not None:
body["top_k"] = top_k
if use_agentic is not None:
body["use_agentic"] = use_agentic
if data_type is not None:
body["data_type"] = data_type
if signal_paths is not None:
Expand Down Expand Up @@ -77,6 +80,7 @@ async def query(
query: str,
namespace: Optional[str] = None,
top_k: Optional[int] = None,
use_agentic: Optional[bool] = None,
data_type: Optional[int] = None,
signal_paths: Optional[list[str]] = None,
filter_mode: Optional[RetrievalFilterMode] = None,
Expand All @@ -94,6 +98,8 @@ async def query(
body["namespace"] = namespace
if top_k is not None:
body["top_k"] = top_k
if use_agentic is not None:
body["use_agentic"] = use_agentic
if data_type is not None:
body["data_type"] = data_type
if signal_paths is not None:
Expand Down
Loading
Loading