Skip to content

feat(backend): NER-based speaker detection using spaCy#5918

Closed
sungdark wants to merge 2 commits intoBasedHardware:mainfrom
sungdark:feat/ner-speaker-detection
Closed

feat(backend): NER-based speaker detection using spaCy#5918
sungdark wants to merge 2 commits intoBasedHardware:mainfrom
sungdark:feat/ner-speaker-detection

Conversation

@sungdark
Copy link
Copy Markdown

Summary

Implements NER-based speaker name detection as described in issue #3039. Uses spaCy's pre-trained NER model as a fallback when regex patterns don't match speaker self-introductions.

Changes

New: backend/utils/ner_speaker_detection.py

  • detect_persons_with_ner(): Extract PERSON entities using spaCy NER
  • _is_valid_person_name(): Filter false positives (companies, products)
  • _normalize_name(): Title-case normalization for names with hyphens/apostrophes
  • Cached model loading to avoid re-initialization overhead
  • Supports self-hosting (no external API calls required)

Modified: backend/utils/speaker_identification.py

  • detect_speaker_from_text() now uses 2-pass approach:
    • Pass 1: Regex patterns (fast, high-precision for self-introductions like "I am X", "My name is Y")
    • Pass 2: NER fallback (catches names in third-person contexts like "John said..." or "Sarah was here")
  • Added detect_all_speakers_from_text() for multi-speaker conversations

backend/requirements.txt

  • Added spacy>=3.7.0

New: backend/tests/unit/test_ner_speaker_detection.py

  • Unit tests for NER detection, name validation, normalization

Why NER?

The current regex-based approach misses many speaker names that don't follow self-introduction patterns. NER can detect:

  • Names mentioned in third person ("John was talking about...")
  • Names in narrative context ("Sarah joined later")
  • Multiple speakers in the same segment

Installation

After merge, users running the backend should install:
pip install spacy
python -m spacy download en_core_web_sm

The code gracefully degrades if spaCy is not installed (NER pass is skipped).

Bounty Reference

Closes #3039 (NER for speaker detection - $500 bounty)

Implements NER-based PERSON entity extraction as a fallback for
regex-based speaker identification (issue BasedHardware#3039).

Changes:
- Add backend/utils/ner_speaker_detection.py with spaCy NER integration
- Modify detect_speaker_from_text() to use 2-pass detection:
  1. Regex patterns (fast, high-precision for self-introductions)
  2. NER fallback (catches names in third-person contexts)
- Add detect_all_speakers_from_text() for multi-speaker detection
- Add spaCy to backend/requirements.txt
- Add unit tests for NER detection module

The NER approach uses spaCy's pre-trained en_core_web_sm model
which can be self-hosted with no external API calls, supporting
the project's requirement for fast, low-cost, open-license NER.

Installation: pip install spacy && python -m spacy download en_core_web_sm
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR adds NER-based speaker detection to the backend using spaCy as a fallback when regex patterns don't match self-introduction phrases. The overall approach is sound — a two-pass strategy (fast regex → NER fallback) fits well within the existing speaker_identification.py architecture and the graceful degradation intent is correct. However, there is a critical bug that silently breaks the entire feature, plus a few architectural and logic issues.

Key issues:

  • Feature is broken as written: spacy.tokensSpan at ner_speaker_detection.py:120 is an invalid attribute. Python evaluates this annotation at import time and raises AttributeError, causing the import to fail. The try/except in _is_ner_available() catches this and reports NER as unavailable — so NER never activates even with spaCy fully installed.
  • Over-broad validation short-circuit: The "is " substring in _is_valid_person_name's context pattern match bypasses all other filters for any entity in a sentence that simply contains the word "is", leading to excessive false positives.
  • In-function imports: Three from utils.ner_speaker_detection import ... calls inside function bodies violate the project's backend import rules.
  • Unpinned dependency: spacy>=3.7.0 in requirements.txt is inconsistent with the pinned versions used by every other dependency.
  • Title-stripping logic mismatch: The context_before guard and the entity-text manipulation target different text spans, making the title-stripping block unlikely to fire as intended.

Confidence Score: 2/5

  • Not safe to merge as-is — the feature is silently broken due to a type annotation bug, and the false-positive logic could misidentify speakers in production transcripts once the bug is fixed.
  • The critical spacy.tokensSpan annotation bug means NER detection never activates even with spaCy installed, rendering the primary contribution of this PR inoperative. Fixing that bug would then expose the over-broad "is " context pattern, which would cause NER to accept many false-positive names. Both issues need to be resolved before this feature can work correctly in production.
  • backend/utils/ner_speaker_detection.py requires the most attention: fix the spacy.tokensSpan type annotation on line 120 and tighten the context pattern on line 108.

Important Files Changed

Filename Overview
backend/utils/ner_speaker_detection.py New NER module using spaCy. Contains a critical bug: spacy.tokensSpan at line 120 is an invalid attribute that causes AttributeError at import time, silently disabling the entire feature. Also has an overly broad "is " context pattern causing false positives, and a flawed title-stripping block that checks context_before while operating on entity text.
backend/utils/speaker_identification.py Extended with a two-pass detection approach (regex + NER fallback) and a new detect_all_speakers_from_text function. Uses in-function imports that violate project rules; NER imports should be module-level or at least moved to a single top-level try/except block.
backend/requirements.txt Adds spacy>=3.7.0 with a range specifier, inconsistent with all other pinned dependencies in the file. Should use an exact version.
backend/tests/unit/test_ner_speaker_detection.py New unit tests for the NER module. Tests are well-structured with good coverage of edge cases. Several tests use loose assertions (e.g., assert isinstance(result, list)) that don't actually verify correctness, but overall coverage is adequate for the initial implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[detect_speaker_from_text / detect_all_speakers_from_text] --> B{Pass 1: Regex patterns}
    B -- match found --> C[Return name.capitalize]
    B -- no match --> D{NER available?}
    D -- No --> E[Return None / empty list]
    D -- Yes --> F[detect_persons_with_ner]
    F --> G[_load_model - cached spaCy model]
    G -- load error --> E
    G -- success --> H[nlp text - run NER pipeline]
    H --> I{For each PERSON entity}
    I --> J[_get_context - window of tokens]
    J --> K{_is_valid_person_name?}
    K -- invalid --> I
    K -- valid --> L[_normalize_name - title case hyphens apostrophes]
    L --> M{Seen before? Deduplicate}
    M -- duplicate --> I
    M -- new --> N[Add to found_names]
    N --> O{max_persons reached?}
    O -- no --> I
    O -- yes --> P[Return found_names]
Loading

Reviews (1): Last reviewed commit: "feat(backend): add NER-based speaker nam..." | Re-trigger Greptile

Comment thread backend/utils/ner_speaker_detection.py Outdated
return True


def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Invalid type annotation causes AttributeError at import time

spacy.tokensSpan is not a valid attribute — the correct path is spacy.tokens.Span. Because Python evaluates function annotations at def-execution time (absent from __future__ import annotations), this triggers an AttributeError: module 'spacy' has no attribute 'tokensSpan' the moment ner_speaker_detection is imported. The try/except in _is_ner_available() catches the import failure, so the app won't crash, but NER will silently never work even when spaCy and the model are both properly installed.

Since Doc is already imported from spacy.tokens, the simplest fix is to import Span the same way:

Suggested change
def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple:
def _get_context(doc: Doc, ent: "spacy.tokens.Span", window: int = 10) -> tuple:

Or add from spacy.tokens import Span at the top of the file and use Span directly:

from spacy.tokens import Doc, Span

def _get_context(doc: Doc, ent: Span, window: int = 10) -> tuple:

Comment thread backend/utils/speaker_identification.py Outdated
Comment on lines +241 to +242
from utils.ner_speaker_detection import is_ner_available
_ner_detection_available = is_ner_available()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 In-function imports violate project import rules

The project's backend import rules prohibit in-function imports. There are three occurrences in this file:

  • speaker_identification.py:241from utils.ner_speaker_detection import is_ner_available (inside _is_ner_available)
  • speaker_identification.py:277from utils.ner_speaker_detection import detect_persons_with_ner (inside detect_speaker_from_text)
  • speaker_identification.py:321from utils.ner_speaker_detection import detect_persons_with_ner (inside detect_all_speakers_from_text)

The intent is to avoid a hard dependency on spaCy, but since spacy>=3.7.0 is now added to requirements.txt, it is already a hard dependency. The lazy-import workaround is therefore inconsistent with the declared requirements.

Move the import to the top of the file and remove the _is_ner_available caching wrapper, which becomes unnecessary:

try:
    from utils.ner_speaker_detection import detect_persons_with_ner, is_ner_available as _ner_is_available
    _ner_detection_available = _ner_is_available()
except Exception:
    _ner_detection_available = False
    detect_persons_with_ner = None  # type: ignore

Context Used: Backend Python import rules - no in-function impor... (source)

Comment thread backend/requirements.txt Outdated
SoundCard==0.4.3
soundfile==0.12.1
soxr==0.4.0
spacy>=3.7.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unpinned version breaks reproducible builds

Every other entry in this file uses an exact pinned version (e.g., SoundCard==0.4.3, speechbrain==1.0.0). Using spacy>=3.7.0 allows any future major/minor release to be pulled in, which could introduce breaking changes silently.

Suggested change
spacy>=3.7.0
spacy==3.7.4

Pick the exact version that has been validated with the rest of the dependency tree.

Comment on lines +105 to +109
# Check for common name introduction patterns - if "name" appears after
# words like "called" or "named", it's more likely to be a real name
context_lower = context_before.lower()
if any(p in context_lower for p in ["called ", "named ", "is ", "this is ", "i'm ", "i am "]):
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Over-broad context pattern causes false positives

The substring "is " used as a short-circuit return True is far too wide. Almost any English sentence contains "is" (e.g., "The server is down", "This is unrelated", "There is no issue"). Any PERSON entity that spaCy detects in a sentence containing "is" will bypass all remaining validation and be accepted unconditionally.

Consider narrowing the patterns to the specific introduction phrases you actually intend to detect, and avoiding the bare "is " substring:

Suggested change
# Check for common name introduction patterns - if "name" appears after
# words like "called" or "named", it's more likely to be a real name
context_lower = context_before.lower()
if any(p in context_lower for p in ["called ", "named ", "is ", "this is ", "i'm ", "i am "]):
return True
if any(p in context_lower for p in ["called ", "named ", "this is ", "i'm ", "i am "]):
return True

Comment on lines +260 to +267
# Additional context-based filtering
# Skip names that are clearly describing someone not present
# (e.g., "President Biden" -> just "Biden")
if "president" in context_before.lower() or "minister" in context_before.lower():
# Extract just the name part
words = normalized.split()
if len(words) > 1 and words[0].lower() in ["president", "minister", "dr", "dr.", "mr", "mr.", "ms", "ms.", "mrs", "mrs."]:
normalized = " ".join(words[1:])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Title-stripping logic checks wrong text

The condition checks context_before for "president" / "minister", but the actual title stripping operates on words[0] of normalized (the entity text itself). These are two different spans.

In practice, spaCy's NER model usually does NOT include honorifics/titles as part of a PERSON entity span — it would extract "Biden" rather than "President Biden". So words[0] of a typical PERSON entity will rarely be "president". The guard condition on context_before would fire in sentences like "Former president Biden said...", but normalized would just be "Biden" (a single word), causing the len(words) > 1 guard to suppress the stripping silently.

The condition that actually makes this work as intended would be to inspect the entity text itself:

words = normalized.split()
if len(words) > 1 and words[0].lower() in {"president", "minister", "dr", "dr.", "mr", "mr.", "ms", "ms.", "mrs", "mrs."}:
    normalized = " ".join(words[1:])

This would be unconditional (no need to check context_before), matching cases where spaCy does include the title in the entity span.

…itive patterns

Critical fixes:
- Fix AttributeError: 'spacy.tokensSpan' is invalid; use 'Span' from spacy.tokens
- Tighten introduction pattern matching with word boundaries to reduce
  false positives (remove overly broad 'is ' substring match)
- Pin spacy and en-core-web-sm versions for reproducibility
- Move in-function NER imports to module level (per project import rules)

Fixes Greptile review issues:
- BasedHardware#5918 (comment)
@beastoin
Copy link
Copy Markdown
Collaborator

AI PRs with low efforts are not welcome here. Thank you. — by CTO

@beastoin beastoin closed this Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hey @sungdark 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use NER (Named Entity Recognition) or better techniques (like self-hosted LLM) to improve speaker detection based on transcripts ($500)

2 participants