feat(backend): NER-based speaker detection using spaCy#5918
feat(backend): NER-based speaker detection using spaCy#5918sungdark wants to merge 2 commits intoBasedHardware:mainfrom
Conversation
Implements NER-based PERSON entity extraction as a fallback for regex-based speaker identification (issue BasedHardware#3039). Changes: - Add backend/utils/ner_speaker_detection.py with spaCy NER integration - Modify detect_speaker_from_text() to use 2-pass detection: 1. Regex patterns (fast, high-precision for self-introductions) 2. NER fallback (catches names in third-person contexts) - Add detect_all_speakers_from_text() for multi-speaker detection - Add spaCy to backend/requirements.txt - Add unit tests for NER detection module The NER approach uses spaCy's pre-trained en_core_web_sm model which can be self-hosted with no external API calls, supporting the project's requirement for fast, low-cost, open-license NER. Installation: pip install spacy && python -m spacy download en_core_web_sm
Greptile SummaryThis PR adds NER-based speaker detection to the backend using spaCy as a fallback when regex patterns don't match self-introduction phrases. The overall approach is sound — a two-pass strategy (fast regex → NER fallback) fits well within the existing Key issues:
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[detect_speaker_from_text / detect_all_speakers_from_text] --> B{Pass 1: Regex patterns}
B -- match found --> C[Return name.capitalize]
B -- no match --> D{NER available?}
D -- No --> E[Return None / empty list]
D -- Yes --> F[detect_persons_with_ner]
F --> G[_load_model - cached spaCy model]
G -- load error --> E
G -- success --> H[nlp text - run NER pipeline]
H --> I{For each PERSON entity}
I --> J[_get_context - window of tokens]
J --> K{_is_valid_person_name?}
K -- invalid --> I
K -- valid --> L[_normalize_name - title case hyphens apostrophes]
L --> M{Seen before? Deduplicate}
M -- duplicate --> I
M -- new --> N[Add to found_names]
N --> O{max_persons reached?}
O -- no --> I
O -- yes --> P[Return found_names]
Reviews (1): Last reviewed commit: "feat(backend): add NER-based speaker nam..." | Re-trigger Greptile |
| return True | ||
|
|
||
|
|
||
| def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple: |
There was a problem hiding this comment.
Invalid type annotation causes
AttributeError at import time
spacy.tokensSpan is not a valid attribute — the correct path is spacy.tokens.Span. Because Python evaluates function annotations at def-execution time (absent from __future__ import annotations), this triggers an AttributeError: module 'spacy' has no attribute 'tokensSpan' the moment ner_speaker_detection is imported. The try/except in _is_ner_available() catches the import failure, so the app won't crash, but NER will silently never work even when spaCy and the model are both properly installed.
Since Doc is already imported from spacy.tokens, the simplest fix is to import Span the same way:
| def _get_context(doc: Doc, ent: spacy.tokensSpan, window: int = 10) -> tuple: | |
| def _get_context(doc: Doc, ent: "spacy.tokens.Span", window: int = 10) -> tuple: |
Or add from spacy.tokens import Span at the top of the file and use Span directly:
from spacy.tokens import Doc, Span
def _get_context(doc: Doc, ent: Span, window: int = 10) -> tuple:| from utils.ner_speaker_detection import is_ner_available | ||
| _ner_detection_available = is_ner_available() |
There was a problem hiding this comment.
In-function imports violate project import rules
The project's backend import rules prohibit in-function imports. There are three occurrences in this file:
speaker_identification.py:241—from utils.ner_speaker_detection import is_ner_available(inside_is_ner_available)speaker_identification.py:277—from utils.ner_speaker_detection import detect_persons_with_ner(insidedetect_speaker_from_text)speaker_identification.py:321—from utils.ner_speaker_detection import detect_persons_with_ner(insidedetect_all_speakers_from_text)
The intent is to avoid a hard dependency on spaCy, but since spacy>=3.7.0 is now added to requirements.txt, it is already a hard dependency. The lazy-import workaround is therefore inconsistent with the declared requirements.
Move the import to the top of the file and remove the _is_ner_available caching wrapper, which becomes unnecessary:
try:
from utils.ner_speaker_detection import detect_persons_with_ner, is_ner_available as _ner_is_available
_ner_detection_available = _ner_is_available()
except Exception:
_ner_detection_available = False
detect_persons_with_ner = None # type: ignoreContext Used: Backend Python import rules - no in-function impor... (source)
| SoundCard==0.4.3 | ||
| soundfile==0.12.1 | ||
| soxr==0.4.0 | ||
| spacy>=3.7.0 |
There was a problem hiding this comment.
Unpinned version breaks reproducible builds
Every other entry in this file uses an exact pinned version (e.g., SoundCard==0.4.3, speechbrain==1.0.0). Using spacy>=3.7.0 allows any future major/minor release to be pulled in, which could introduce breaking changes silently.
| spacy>=3.7.0 | |
| spacy==3.7.4 |
Pick the exact version that has been validated with the rest of the dependency tree.
| # Check for common name introduction patterns - if "name" appears after | ||
| # words like "called" or "named", it's more likely to be a real name | ||
| context_lower = context_before.lower() | ||
| if any(p in context_lower for p in ["called ", "named ", "is ", "this is ", "i'm ", "i am "]): | ||
| return True |
There was a problem hiding this comment.
Over-broad context pattern causes false positives
The substring "is " used as a short-circuit return True is far too wide. Almost any English sentence contains "is" (e.g., "The server is down", "This is unrelated", "There is no issue"). Any PERSON entity that spaCy detects in a sentence containing "is" will bypass all remaining validation and be accepted unconditionally.
Consider narrowing the patterns to the specific introduction phrases you actually intend to detect, and avoiding the bare "is " substring:
| # Check for common name introduction patterns - if "name" appears after | |
| # words like "called" or "named", it's more likely to be a real name | |
| context_lower = context_before.lower() | |
| if any(p in context_lower for p in ["called ", "named ", "is ", "this is ", "i'm ", "i am "]): | |
| return True | |
| if any(p in context_lower for p in ["called ", "named ", "this is ", "i'm ", "i am "]): | |
| return True |
| # Additional context-based filtering | ||
| # Skip names that are clearly describing someone not present | ||
| # (e.g., "President Biden" -> just "Biden") | ||
| if "president" in context_before.lower() or "minister" in context_before.lower(): | ||
| # Extract just the name part | ||
| words = normalized.split() | ||
| if len(words) > 1 and words[0].lower() in ["president", "minister", "dr", "dr.", "mr", "mr.", "ms", "ms.", "mrs", "mrs."]: | ||
| normalized = " ".join(words[1:]) |
There was a problem hiding this comment.
Title-stripping logic checks wrong text
The condition checks context_before for "president" / "minister", but the actual title stripping operates on words[0] of normalized (the entity text itself). These are two different spans.
In practice, spaCy's NER model usually does NOT include honorifics/titles as part of a PERSON entity span — it would extract "Biden" rather than "President Biden". So words[0] of a typical PERSON entity will rarely be "president". The guard condition on context_before would fire in sentences like "Former president Biden said...", but normalized would just be "Biden" (a single word), causing the len(words) > 1 guard to suppress the stripping silently.
The condition that actually makes this work as intended would be to inspect the entity text itself:
words = normalized.split()
if len(words) > 1 and words[0].lower() in {"president", "minister", "dr", "dr.", "mr", "mr.", "ms", "ms.", "mrs", "mrs."}:
normalized = " ".join(words[1:])This would be unconditional (no need to check context_before), matching cases where spaCy does include the title in the entity span.
…itive patterns Critical fixes: - Fix AttributeError: 'spacy.tokensSpan' is invalid; use 'Span' from spacy.tokens - Tighten introduction pattern matching with word boundaries to reduce false positives (remove overly broad 'is ' substring match) - Pin spacy and en-core-web-sm versions for reproducibility - Move in-function NER imports to module level (per project import rules) Fixes Greptile review issues: - BasedHardware#5918 (comment)
|
AI PRs with low efforts are not welcome here. Thank you. — by CTO |
|
Hey @sungdark 👋 Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request. After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:
Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out. Thank you for being part of the Omi community! 💜 |
Summary
Implements NER-based speaker name detection as described in issue #3039. Uses spaCy's pre-trained NER model as a fallback when regex patterns don't match speaker self-introductions.
Changes
New: backend/utils/ner_speaker_detection.py
Modified: backend/utils/speaker_identification.py
backend/requirements.txt
New: backend/tests/unit/test_ner_speaker_detection.py
Why NER?
The current regex-based approach misses many speaker names that don't follow self-introduction patterns. NER can detect:
Installation
After merge, users running the backend should install:
pip install spacy
python -m spacy download en_core_web_sm
The code gracefully degrades if spaCy is not installed (NER pass is skipped).
Bounty Reference
Closes #3039 (NER for speaker detection - $500 bounty)