Skip to content

[BUG] FastEmbed embeddings are not L2-normalized → semantic search silently degrades to FTS-only for non-bge models (e.g. multilingual-mpnet) #1023

Description

@SloNN

Summary

FastEmbedEmbeddingProvider does not L2-normalize embedding vectors. The SQLite
distance→similarity formula introduced in #593 (cos_sim = 1 - L2²/2) is only valid
for unit vectors. The default model bge-small-en-v1.5 is normalized by FastEmbed,
so this never surfaced. But models that FastEmbed does not normalize — notably
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 (the natural multilingual
choice) — produce vectors with norm ≈ 2.5, so the formula yields negative similarity,
clamps to ~0, and the default semantic_min_similarity=0.55 filters out the entire
semantic channel. Hybrid search silently becomes FTS-only.

Impact

On a real vault (multilingual-mpnet, 499 notes, v0.21.5), 59% of 242 real recall
queries returned zero results
(mined from agent transcripts). Conceptual and
inflected (Russian) queries failed because FTS has no lemmatization and the semantic
fallback was dead. After fixing normalization + reindex: empty rate 59% → 2%
(145/149 previously-empty queries recovered; the 4 remaining are genuinely-absent content).

Root cause

basic_memory/repository/fastembed_provider.py::embed_documents._embed_batch:

vectors = list(model.embed(texts, **embed_kwargs))
normalized: list[list[float]] = []          # named "normalized"...
for vector in vectors:
    values = vector.tolist() if hasattr(vector, "tolist") else vector
    normalized.append([float(value) for value in values])   # ...but never normalized
return normalized

The vector is never L2-normalized. Combined with the sqlite 1 - L2²/2 similarity
(valid only for unit vectors), any model that doesn't self-normalize is broken.

Evidence (measured)

metric value
stored vector norm (mpnet) ~2.5 (should be 1.0)
honest cosine, relevant chunk pair 0.61
similarity BM reports 0.00
1 - L2²/2 with norm-2.5 vectors −1.44 → clamped 0
empty recall rate (before) 59% (145/242)
empty recall rate (after normalize+reindex) 2%

Verified that intfloat/multilingual-e5-large is also un-normalized via FastEmbed
(norm ≈ 29), so this affects multiple multilingual models, not just mpnet.

Suggested fix

L2-normalize in _embed_batch (handles every model, idempotent for already-normalized
bge):

values = [float(v) for v in (vector.tolist() if hasattr(vector, "tolist") else vector)]
norm = sum(v * v for v in values) ** 0.5
if norm > 0.0:
    values = [v / norm for v in values]
normalized.append(values)

Alternatively declare the vec0 table with distance_metric=cosine and branch the
similarity conversion accordingly.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions