Summary
FastEmbedEmbeddingProvider does not L2-normalize embedding vectors. The SQLite
distance→similarity formula introduced in #593 (cos_sim = 1 - L2²/2) is only valid
for unit vectors. The default model bge-small-en-v1.5 is normalized by FastEmbed,
so this never surfaced. But models that FastEmbed does not normalize — notably
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 (the natural multilingual
choice) — produce vectors with norm ≈ 2.5, so the formula yields negative similarity,
clamps to ~0, and the default semantic_min_similarity=0.55 filters out the entire
semantic channel. Hybrid search silently becomes FTS-only.
Impact
On a real vault (multilingual-mpnet, 499 notes, v0.21.5), 59% of 242 real recall
queries returned zero results (mined from agent transcripts). Conceptual and
inflected (Russian) queries failed because FTS has no lemmatization and the semantic
fallback was dead. After fixing normalization + reindex: empty rate 59% → 2%
(145/149 previously-empty queries recovered; the 4 remaining are genuinely-absent content).
Root cause
basic_memory/repository/fastembed_provider.py::embed_documents._embed_batch:
vectors = list(model.embed(texts, **embed_kwargs))
normalized: list[list[float]] = [] # named "normalized"...
for vector in vectors:
values = vector.tolist() if hasattr(vector, "tolist") else vector
normalized.append([float(value) for value in values]) # ...but never normalized
return normalized
The vector is never L2-normalized. Combined with the sqlite 1 - L2²/2 similarity
(valid only for unit vectors), any model that doesn't self-normalize is broken.
Evidence (measured)
| metric |
value |
| stored vector norm (mpnet) |
~2.5 (should be 1.0) |
| honest cosine, relevant chunk pair |
0.61 |
| similarity BM reports |
0.00 |
1 - L2²/2 with norm-2.5 vectors |
−1.44 → clamped 0 |
| empty recall rate (before) |
59% (145/242) |
| empty recall rate (after normalize+reindex) |
2% |
Verified that intfloat/multilingual-e5-large is also un-normalized via FastEmbed
(norm ≈ 29), so this affects multiple multilingual models, not just mpnet.
Suggested fix
L2-normalize in _embed_batch (handles every model, idempotent for already-normalized
bge):
values = [float(v) for v in (vector.tolist() if hasattr(vector, "tolist") else vector)]
norm = sum(v * v for v in values) ** 0.5
if norm > 0.0:
values = [v / norm for v in values]
normalized.append(values)
Alternatively declare the vec0 table with distance_metric=cosine and branch the
similarity conversion accordingly.
Environment
Summary
FastEmbedEmbeddingProviderdoes not L2-normalize embedding vectors. The SQLitedistance→similarity formula introduced in #593 (
cos_sim = 1 - L2²/2) is only validfor unit vectors. The default model
bge-small-en-v1.5is normalized by FastEmbed,so this never surfaced. But models that FastEmbed does not normalize — notably
sentence-transformers/paraphrase-multilingual-mpnet-base-v2(the natural multilingualchoice) — produce vectors with norm ≈ 2.5, so the formula yields negative similarity,
clamps to ~0, and the default
semantic_min_similarity=0.55filters out the entiresemantic channel. Hybrid search silently becomes FTS-only.
Impact
On a real vault (multilingual-mpnet, 499 notes, v0.21.5), 59% of 242 real recall
queries returned zero results (mined from agent transcripts). Conceptual and
inflected (Russian) queries failed because FTS has no lemmatization and the semantic
fallback was dead. After fixing normalization + reindex: empty rate 59% → 2%
(145/149 previously-empty queries recovered; the 4 remaining are genuinely-absent content).
Root cause
basic_memory/repository/fastembed_provider.py::embed_documents._embed_batch:The vector is never L2-normalized. Combined with the sqlite
1 - L2²/2similarity(valid only for unit vectors), any model that doesn't self-normalize is broken.
Evidence (measured)
1 - L2²/2with norm-2.5 vectorsVerified that
intfloat/multilingual-e5-largeis also un-normalized via FastEmbed(norm ≈ 29), so this affects multiple multilingual models, not just mpnet.
Suggested fix
L2-normalize in
_embed_batch(handles every model, idempotent for already-normalizedbge):
Alternatively declare the vec0 table with
distance_metric=cosineand branch thesimilarity conversion accordingly.
Environment
sentence-transformers/paraphrase-multilingual-mpnet-base-v2(768d),default_search_type=hybrid