Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,44 +21,53 @@ The main advantages of using ClickHouse for vector search compared to using more

Here is a quick tutorial on how to use ClickHouse for vector search.

## 1. Create embeddings {#1-create-embeddings}
Your data (documents, images, or structured data) must be converted to _embeddings_. We recommend creating embeddings using the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) or using the open-source Python library [SentenceTransformers](https://www.sbert.net/).
<Steps>
<Step>
## Create embeddings {#1-create-embeddings}

You can think of an embedding as a large array of floating-point numbers that represent your data. [Check out this guide from OpenAI to learn more about embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings).
Your data (documents, images, or structured data) must be converted to _embeddings_. We recommend creating embeddings using the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) or using the open-source Python library [SentenceTransformers](https://www.sbert.net/).

## 2. Store the embeddings {#2-store-the-embeddings}
Once you have generated embeddings, you need to store them in ClickHouse. Each embedding should be stored in a separate row and can include metadata for filtering, aggregations, or analytics. Here's an example of a table that can store images with captions:
You can think of an embedding as a large array of floating-point numbers that represent your data. [Check out this guide from OpenAI to learn more about embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings).
</Step>
<Step>
## Store the embeddings {#2-store-the-embeddings}

```sql
CREATE TABLE images
(
`_file` LowCardinality(String),
`caption` String,
`image_embedding` Array(Float32)
)
ENGINE = MergeTree;
```
Once you have generated embeddings, you need to store them in ClickHouse. Each embedding should be stored in a separate row and can include metadata for filtering, aggregations, or analytics. Here's an example of a table that can store images with captions:

## 3. Search for related embeddings {#3-search-for-related-embeddings}
Let's say you want to search for pictures of dogs in your dataset. You can use a distance function like `cosineDistance` to take an embedding of a dog image and search for related images:
```sql
CREATE TABLE images
(
`_file` LowCardinality(String),
`caption` String,
`image_embedding` Array(Float32)
)
ENGINE = MergeTree;
```
</Step>
<Step>
## Search for related embeddings {#3-search-for-related-embeddings}

```sql
SELECT
_file,
caption,
cosineDistance(
-- An embedding of your "input" dog picture
[0.5736801028251648, 0.2516217529773712, ..., -0.6825592517852783],
image_embedding
) AS score
FROM images
ORDER BY score ASC
LIMIT 10
```
Let's say you want to search for pictures of dogs in your dataset. You can use a distance function like `cosineDistance` to take an embedding of a dog image and search for related images:

This query returns the `_file` names and `caption` of the top 10 images most likely to be related to your provided dog image.
```sql
SELECT
_file,
caption,
cosineDistance(
-- An embedding of your "input" dog picture
[0.5736801028251648, 0.2516217529773712, ..., -0.6825592517852783],
image_embedding
) AS score
FROM images
ORDER BY score ASC
LIMIT 10
```

## Further Reading {#further-reading}
This query returns the `_file` names and `caption` of the top 10 images most likely to be related to your provided dog image.
</Step>
</Steps>

## Further reading {#further-reading}
To follow a more in-depth tutorial on vector search using ClickHouse, please see:

- [Exact and Approximate Vector Search](/reference/engines/table-engines/mergetree-family/annindexes)