feat(backend): NER-based speaker detection using spaCy by sungdark · Pull Request #5918 · BasedHardware/omi

sungdark · 2026-03-23T00:39:31Z

Summary

Implements NER-based speaker name detection as described in issue #3039. Uses spaCy's pre-trained NER model as a fallback when regex patterns don't match speaker self-introductions.

Changes

New: backend/utils/ner_speaker_detection.py

detect_persons_with_ner(): Extract PERSON entities using spaCy NER
_is_valid_person_name(): Filter false positives (companies, products)
_normalize_name(): Title-case normalization for names with hyphens/apostrophes
Cached model loading to avoid re-initialization overhead
Supports self-hosting (no external API calls required)

Modified: backend/utils/speaker_identification.py

detect_speaker_from_text() now uses 2-pass approach:
- Pass 1: Regex patterns (fast, high-precision for self-introductions like "I am X", "My name is Y")
- Pass 2: NER fallback (catches names in third-person contexts like "John said..." or "Sarah was here")
Added detect_all_speakers_from_text() for multi-speaker conversations

backend/requirements.txt

Added spacy>=3.7.0

New: backend/tests/unit/test_ner_speaker_detection.py

Unit tests for NER detection, name validation, normalization

Why NER?

The current regex-based approach misses many speaker names that don't follow self-introduction patterns. NER can detect:

Names mentioned in third person ("John was talking about...")
Names in narrative context ("Sarah joined later")
Multiple speakers in the same segment

Installation

After merge, users running the backend should install:
pip install spacy
python -m spacy download en_core_web_sm

The code gracefully degrades if spaCy is not installed (NER pass is skipped).

Bounty Reference

Closes #3039 (NER for speaker detection - $500 bounty)

Implements NER-based PERSON entity extraction as a fallback for regex-based speaker identification (issue BasedHardware#3039). Changes: - Add backend/utils/ner_speaker_detection.py with spaCy NER integration - Modify detect_speaker_from_text() to use 2-pass detection: 1. Regex patterns (fast, high-precision for self-introductions) 2. NER fallback (catches names in third-person contexts) - Add detect_all_speakers_from_text() for multi-speaker detection - Add spaCy to backend/requirements.txt - Add unit tests for NER detection module The NER approach uses spaCy's pre-trained en_core_web_sm model which can be self-hosted with no external API calls, supporting the project's requirement for fast, low-cost, open-license NER. Installation: pip install spacy && python -m spacy download en_core_web_sm

greptile-apps · 2026-03-23T00:50:28Z

Greptile Summary

This PR adds NER-based speaker detection to the backend using spaCy as a fallback when regex patterns don't match self-introduction phrases. The overall approach is sound — a two-pass strategy (fast regex → NER fallback) fits well within the existing speaker_identification.py architecture and the graceful degradation intent is correct. However, there is a critical bug that silently breaks the entire feature, plus a few architectural and logic issues.

Key issues:

Feature is broken as written: spacy.tokensSpan at ner_speaker_detection.py:120 is an invalid attribute. Python evaluates this annotation at import time and raises AttributeError, causing the import to fail. The try/except in _is_ner_available() catches this and reports NER as unavailable — so NER never activates even with spaCy fully installed.
Over-broad validation short-circuit: The "is " substring in _is_valid_person_name's context pattern match bypasses all other filters for any entity in a sentence that simply contains the word "is", leading to excessive false positives.
In-function imports: Three from utils.ner_speaker_detection import ... calls inside function bodies violate the project's backend import rules.
Unpinned dependency: spacy>=3.7.0 in requirements.txt is inconsistent with the pinned versions used by every other dependency.
Title-stripping logic mismatch: The context_before guard and the entity-text manipulation target different text spans, making the title-stripping block unlikely to fire as intended.

Confidence Score: 2/5

Not safe to merge as-is — the feature is silently broken due to a type annotation bug, and the false-positive logic could misidentify speakers in production transcripts once the bug is fixed.
The critical spacy.tokensSpan annotation bug means NER detection never activates even with spaCy installed, rendering the primary contribution of this PR inoperative. Fixing that bug would then expose the over-broad "is " context pattern, which would cause NER to accept many false-positive names. Both issues need to be resolved before this feature can work correctly in production.
backend/utils/ner_speaker_detection.py requires the most attention: fix the spacy.tokensSpan type annotation on line 120 and tighten the context pattern on line 108.

Important Files Changed

Filename	Overview
backend/utils/ner_speaker_detection.py	New NER module using spaCy. Contains a critical bug: `spacy.tokensSpan` at line 120 is an invalid attribute that causes `AttributeError` at import time, silently disabling the entire feature. Also has an overly broad `"is "` context pattern causing false positives, and a flawed title-stripping block that checks `context_before` while operating on entity text.
backend/utils/speaker_identification.py	Extended with a two-pass detection approach (regex + NER fallback) and a new `detect_all_speakers_from_text` function. Uses in-function imports that violate project rules; NER imports should be module-level or at least moved to a single top-level try/except block.
backend/requirements.txt	Adds `spacy>=3.7.0` with a range specifier, inconsistent with all other pinned dependencies in the file. Should use an exact version.
backend/tests/unit/test_ner_speaker_detection.py	New unit tests for the NER module. Tests are well-structured with good coverage of edge cases. Several tests use loose assertions (e.g., `assert isinstance(result, list)`) that don't actually verify correctness, but overall coverage is adequate for the initial implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[detect_speaker_from_text / detect_all_speakers_from_text] --> B{Pass 1: Regex patterns}
    B -- match found --> C[Return name.capitalize]
    B -- no match --> D{NER available?}
    D -- No --> E[Return None / empty list]
    D -- Yes --> F[detect_persons_with_ner]
    F --> G[_load_model - cached spaCy model]
    G -- load error --> E
    G -- success --> H[nlp text - run NER pipeline]
    H --> I{For each PERSON entity}
    I --> J[_get_context - window of tokens]
    J --> K{_is_valid_person_name?}
    K -- invalid --> I
    K -- valid --> L[_normalize_name - title case hyphens apostrophes]
    L --> M{Seen before? Deduplicate}
    M -- duplicate --> I
    M -- new --> N[Add to found_names]
    N --> O{max_persons reached?}
    O -- no --> I
    O -- yes --> P[Return found_names]

_{Reviews (1): Last reviewed commit: "feat(backend): add NER-based speaker nam..." | Re-trigger Greptile}

greptile-apps · 2026-03-23T00:50:31Z

+    return True
+
+
+def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple:


Invalid type annotation causes AttributeError at import time

spacy.tokensSpan is not a valid attribute — the correct path is spacy.tokens.Span. Because Python evaluates function annotations at def-execution time (absent from __future__ import annotations), this triggers an AttributeError: module 'spacy' has no attribute 'tokensSpan' the moment ner_speaker_detection is imported. The try/except in _is_ner_available() catches the import failure, so the app won't crash, but NER will silently never work even when spaCy and the model are both properly installed.

Since Doc is already imported from spacy.tokens, the simplest fix is to import Span the same way:

Suggested change

def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple:

def _get_context(doc: Doc, ent: "spacy.tokens.Span", window: int = 10) -> tuple:

Or add from spacy.tokens import Span at the top of the file and use Span directly:

from spacy.tokens import Doc, Span def _get_context(doc: Doc, ent: Span, window: int = 10) -> tuple:

greptile-apps · 2026-03-23T00:50:32Z

+            from utils.ner_speaker_detection import is_ner_available
+            _ner_detection_available = is_ner_available()


In-function imports violate project import rules

The project's backend import rules prohibit in-function imports. There are three occurrences in this file:

speaker_identification.py:241 — from utils.ner_speaker_detection import is_ner_available (inside _is_ner_available)

speaker_identification.py:277 — from utils.ner_speaker_detection import detect_persons_with_ner (inside detect_speaker_from_text)

speaker_identification.py:321 — from utils.ner_speaker_detection import detect_persons_with_ner (inside detect_all_speakers_from_text)

The intent is to avoid a hard dependency on spaCy, but since spacy>=3.7.0 is now added to requirements.txt, it is already a hard dependency. The lazy-import workaround is therefore inconsistent with the declared requirements.

Move the import to the top of the file and remove the _is_ner_available caching wrapper, which becomes unnecessary:

try: from utils.ner_speaker_detection import detect_persons_with_ner, is_ner_available as _ner_is_available _ner_detection_available = _ner_is_available() except Exception: _ner_detection_available = False detect_persons_with_ner = None # type: ignore

Context Used: Backend Python import rules - no in-function impor... (source)

greptile-apps · 2026-03-23T00:50:33Z

 SoundCard==0.4.3
 soundfile==0.12.1
 soxr==0.4.0
+spacy>=3.7.0


Unpinned version breaks reproducible builds

Every other entry in this file uses an exact pinned version (e.g., SoundCard==0.4.3, speechbrain==1.0.0). Using spacy>=3.7.0 allows any future major/minor release to be pulled in, which could introduce breaking changes silently.

Suggested change

spacy>=3.7.0

spacy==3.7.4

Pick the exact version that has been validated with the rest of the dependency tree.

greptile-apps · 2026-03-23T00:50:34Z

+    # Check for common name introduction patterns - if "name" appears after
+    # words like "called" or "named", it's more likely to be a real name
+    context_lower = context_before.lower()
+    if any(p in context_lower for p in ["called ", "named ", "is ", "this is ", "i'm ", "i am "]):
+        return True


Over-broad context pattern causes false positives

The substring "is " used as a short-circuit return True is far too wide. Almost any English sentence contains "is" (e.g., "The server is down", "This is unrelated", "There is no issue"). Any PERSON entity that spaCy detects in a sentence containing "is" will bypass all remaining validation and be accepted unconditionally.

Consider narrowing the patterns to the specific introduction phrases you actually intend to detect, and avoiding the bare "is " substring:

Suggested change

# Check for common name introduction patterns - if "name" appears after

# words like "called" or "named", it's more likely to be a real name

context_lower = context_before.lower()

if any(p in context_lower for p in ["called ", "named ", "is ", "this is ", "i'm ", "i am "]):

return True

if any(p in context_lower for p in ["called ", "named ", "this is ", "i'm ", "i am "]):

return True

greptile-apps · 2026-03-23T00:50:35Z

+        # Additional context-based filtering
+        # Skip names that are clearly describing someone not present
+        # (e.g., "President Biden" -> just "Biden")
+        if "president" in context_before.lower() or "minister" in context_before.lower():
+            # Extract just the name part
+            words = normalized.split()
+            if len(words) > 1 and words[0].lower() in ["president", "minister", "dr", "dr.", "mr", "mr.", "ms", "ms.", "mrs", "mrs."]:
+                normalized = " ".join(words[1:])


Title-stripping logic checks wrong text

The condition checks context_before for "president" / "minister", but the actual title stripping operates on words[0] of normalized (the entity text itself). These are two different spans.

In practice, spaCy's NER model usually does NOT include honorifics/titles as part of a PERSON entity span — it would extract "Biden" rather than "President Biden". So words[0] of a typical PERSON entity will rarely be "president". The guard condition on context_before would fire in sentences like "Former president Biden said...", but normalized would just be "Biden" (a single word), causing the len(words) > 1 guard to suppress the stripping silently.

The condition that actually makes this work as intended would be to inspect the entity text itself:

words = normalized.split() if len(words) > 1 and words[0].lower() in {"president", "minister", "dr", "dr.", "mr", "mr.", "ms", "ms.", "mrs", "mrs."}: normalized = " ".join(words[1:])

This would be unconditional (no need to check context_before), matching cases where spaCy does include the title in the entity span.

…itive patterns Critical fixes: - Fix AttributeError: 'spacy.tokensSpan' is invalid; use 'Span' from spacy.tokens - Tighten introduction pattern matching with word boundaries to reduce false positives (remove overly broad 'is ' substring match) - Pin spacy and en-core-web-sm versions for reproducibility - Move in-function NER imports to module level (per project import rules) Fixes Greptile review issues: - BasedHardware#5918 (comment)

beastoin · 2026-03-28T08:32:39Z

AI PRs with low efforts are not welcome here. Thank you. — by CTO

github-actions · 2026-03-28T08:32:48Z

Hey @sungdark 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

Project standards — Ensuring consistency across the codebase
User needs — Making sure changes align with what our users need
Code best practices — Maintaining code quality and maintainability
Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

sungdark mentioned this pull request Mar 23, 2026

Use NER (Named Entity Recognition) or better techniques (like self-hosted LLM) to improve speaker detection based on transcripts ($500) #3039

Open

greptile-apps bot reviewed Mar 23, 2026

View reviewed changes

beastoin closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backend): NER-based speaker detection using spaCy#5918

feat(backend): NER-based speaker detection using spaCy#5918
sungdark wants to merge 2 commits intoBasedHardware:mainfrom
sungdark:feat/ner-speaker-detection

sungdark commented Mar 23, 2026

Uh oh!

greptile-apps bot commented Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

beastoin commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return True


		def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple:

	def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple:
	def _get_context(doc: Doc, ent: "spacy.tokens.Span", window: int = 10) -> tuple:

		from utils.ner_speaker_detection import is_ner_available
		_ner_detection_available = is_ner_available()

Conversation

sungdark commented Mar 23, 2026

Summary

Changes

New: backend/utils/ner_speaker_detection.py

Modified: backend/utils/speaker_identification.py

backend/requirements.txt

New: backend/tests/unit/test_ner_speaker_detection.py

Why NER?

Installation

Bounty Reference

Uh oh!

greptile-apps bot commented Mar 23, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

beastoin commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants