Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,74 @@ for chunk in result.text_chunks:
print(chunk.content[:80])
```

## Retrieval and document lifecycle

New documents are published into a retrieval namespace. The server returns a
stable `document_id` when you create a job; persist that value if you need to
update or archive the same document later.

```python
job = client.jobs.create(
source_type="url",
source_url="https://example.com/manual.pdf",
namespace="support-center",
)

print(job.document_id) # "doc_..."
```

After the job is done and published, query the canonical document content:

```python
response = client.retrieval.query(
namespace="support-center",
query="How do I reset Bluetooth pairing?",
top_k=5,
)

for result in response.results:
print(result.content)
print(result.score)
print(result.source.source_file_name, result.source.section_path)
```

Use `document_id` to update or archive a document:

```python
update_job = client.jobs.create(
source_type="url",
source_url="https://example.com/manual-v2.pdf",
document_id=job.document_id,
)

document = client.documents.get(job.document_id)
print(document.status)

client.documents.archive(job.document_id)
```

You can also list documents in a namespace:

```python
documents = client.documents.list(namespace="support-center")
for document in documents.documents:
print(document.document_id, document.status)
```

Retrieval supports exclusions when clients want follow-up results that avoid
previously used documents or sections:

```python
response = client.retrieval.query(
namespace="support-center",
query="battery charging",
exclude_document_ids=["doc_old"],
exclude_sections=[
{"document_id": "doc_123", "section_path": "Appendix / Legal"}
],
)
```

While you can provide an `api_key` keyword argument, we recommend using [python-dotenv](https://pypi.org/project/python-dotenv/) to add `KNOWHERE_API_KEY="sk_..."` to your `.env` file so that your API key is not stored in source control.

### Parse a local file
Expand Down Expand Up @@ -105,9 +173,12 @@ from pathlib import Path
job = client.jobs.create(
source_type="file",
file_name="report.pdf",
namespace="support-center",
parsing_params={"model": "advanced", "ocr_enabled": True},
)

print(job.document_id) # Persist this to update/archive the document later.

# Step 2: Upload file to presigned URL
client.jobs.upload(job, file=Path("report.pdf"))

Expand Down
127 changes: 127 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Comprehensive reference for every feature, parameter, and pattern in the SDK.
- [Working with Results](#working-with-results)
- [Chunk Types](#chunk-types)
- [Step-by-Step Control (Jobs API)](#step-by-step-control-jobs-api)
- [Retrieval and Document Lifecycle](#retrieval-and-document-lifecycle)
- [Async Usage](#async-usage)
- [Progress Callbacks](#progress-callbacks)
- [Error Handling](#error-handling)
Expand Down Expand Up @@ -316,8 +317,10 @@ from pathlib import Path
job = client.jobs.create(
source_type="file",
file_name="report.pdf",
namespace="support-center",
parsing_params={"model": "advanced", "ocr_enabled": True},
)
print(job.document_id) # Persist this value for update/archive flows.

# Step 2: Upload file to the presigned URL
client.jobs.upload(job, file=Path("report.pdf"))
Expand All @@ -341,6 +344,8 @@ print(result.statistics)
| `source_type` | `"url" \| "file"` | — | Required. Whether parsing from URL or uploaded file. |
| `source_url` | `str \| None` | `None` | URL to parse (required when `source_type="url"`). |
| `file_name` | `str \| None` | `None` | Original filename (used when `source_type="file"`). |
| `namespace` | `str \| None` | `None` | Retrieval namespace. The server defaults to `"default"` when omitted. |
| `document_id` | `str \| None` | `None` | Existing document ID when creating an update job. Omit for a new document. |
| `data_id` | `str \| None` | `None` | Your own correlation/idempotency identifier. |
| `parsing_params` | `ParsingParams \| None` | `None` | Parsing configuration. |
| `webhook` | `WebhookConfig \| None` | `None` | Webhook for completion notification. |
Expand All @@ -351,6 +356,8 @@ Returns a `Job` object:
job.job_id # "abc-123"
job.status # "pending"
job.source_type # "file"
job.namespace # "support-center"
job.document_id # "doc_..." — persist this for updates and archive calls
job.upload_url # presigned URL (for file uploads)
job.upload_headers # headers to include in the upload request
job.expires_in # seconds until upload URL expires
Expand Down Expand Up @@ -407,6 +414,119 @@ result = client.jobs.load("https://storage.example.com/result.zip")

---

## Retrieval and Document Lifecycle

The retrieval APIs operate on canonical documents that are published after a
job completes. For new documents, the server generates `document_id` during
`jobs.create()`. Store that ID in your application if you need to update or
archive the same document later.

### Create a retrievable document

```python
job = client.jobs.create(
source_type="url",
source_url="https://example.com/manual.pdf",
namespace="support-center",
)

print(job.document_id) # "doc_..."
```

For file uploads, the flow is the same except that you upload the file before
polling:

```python
job = client.jobs.create(
source_type="file",
file_name="manual.pdf",
namespace="support-center",
)
client.jobs.upload(job, file=Path("manual.pdf"))
job_result = client.jobs.wait(job.job_id)
```

### Update an existing document

Pass the prior `document_id` to create an update job. If `namespace` is omitted,
the API resolves the namespace from the existing document.

```python
update_job = client.jobs.create(
source_type="url",
source_url="https://example.com/manual-v2.pdf",
document_id=job.document_id,
)
```

The API rejects concurrent non-terminal jobs for the same document with a
retryable `ConflictError` using the server error code `ABORTED`.

### Query retrieval results

```python
response = client.retrieval.query(
namespace="support-center",
query="How do I pair a Bluetooth headset?",
top_k=5,
)

for result in response.results:
print(result.content)
print(result.score)
print(result.source.document_id)
print(result.source.source_file_name)
print(result.source.section_path)
```

Retrieval results expose `content`, not the older parse-result `text` field.
Media results may include `asset_url` when the server can sign the referenced
artifact.

Each retrieval result uses one canonical source reference shape:

```python
result.content
result.chunk_type
result.score
result.asset_url # Optional[str]
result.source.document_id
result.source.source_file_name
result.source.section_path
```

### Exclude documents or sections

Use exclusions for follow-up queries that should avoid already-used context.

```python
response = client.retrieval.query(
namespace="support-center",
query="battery charging",
top_k=10,
exclude_document_ids=["doc_old"],
exclude_sections=[
{"document_id": "doc_123", "section_path": "Appendix / Legal"}
],
)
```

### List, get, and archive documents

```python
document_list = client.documents.list(namespace="support-center")
for document in document_list.documents:
print(document.document_id, document.status, document.source_file_name)

document = client.documents.get("doc_123")
print(document.current_job_result_id)

archived = client.documents.archive("doc_123")
print(archived.status) # "archived"
```

---

## Async Usage

Every method available on `Knowhere` has an async counterpart on `AsyncKnowhere`:
Expand All @@ -429,6 +549,13 @@ async def main():
job_result = await client.jobs.wait(job.job_id)
result = await client.jobs.load(job_result)

retrieval = await client.retrieval.query(
namespace="support-center",
query="refund policy",
top_k=5,
)
print(retrieval.results[0].content)

asyncio.run(main())
```

Expand Down
13 changes: 13 additions & 0 deletions src/knowhere/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,14 @@
)
from knowhere._types import PollProgressCallback, UploadProgressCallback
from knowhere._version import __version__
from knowhere.types.document import Document, DocumentListResponse
from knowhere.types.job import Job, JobError, JobProgress, JobResult
from knowhere.types.params import ParsingParams, WebhookConfig
from knowhere.types.retrieval import (
RetrievalSource,
RetrievalQueryResponse,
RetrievalResult,
)
from knowhere.types.result import (
BaseChunk,
Checksum,
Expand Down Expand Up @@ -87,6 +93,13 @@
"JobError",
"JobProgress",
"JobResult",
# Document types
"Document",
"DocumentListResponse",
# Retrieval types
"RetrievalSource",
"RetrievalQueryResponse",
"RetrievalResult",
# Result types
"ParseResult",
"Manifest",
Expand Down
Loading
Loading