Pod ml9h7 killed by liveness probe (10:01 UTC 2026-03-26) while serving 22 WS connections. Root cause: _create_conversation_fallback in transcribe.py:715-737 runs process_conversation() synchronously on the backend-listen event loop when pusher is degraded.
Current Behavior
- When pusher is degraded, conversation processing falls back to
_create_conversation_fallback
process_conversation() (line 729) is sync-blocking: OpenAI LLM calls, embeddings, Firestore writes, integration triggers
- This blocks the event loop for 30+ seconds
- Health endpoint cannot respond within 5s liveness timeout
- Pod killed after 5 consecutive failures, disconnecting all active WS sessions
Expected Behavior
- Fallback conversation processing must not block the backend-listen event loop
- Health endpoint must always remain responsive
- Pod stability must be maintained regardless of pusher state
Affected Areas
| File |
Line |
Description |
| routers/transcribe.py |
715-737 |
_create_conversation_fallback() runs process_conversation() directly |
| routers/transcribe.py |
729 |
The blocking process_conversation() call |
| routers/transcribe.py |
730 |
Blocking trigger_external_integrations() call |
| routers/transcribe.py |
841-861 |
_process_conversation() routes to fallback when pusher degraded |
| routers/transcribe.py |
739-754 |
cleanup_processing_conversations() also routes to fallback |
Solution
Wrap blocking calls in asyncio.to_thread() and add asyncio.Semaphore(1) to limit concurrent fallback processing per session, preventing cascade when multiple sessions degrade simultaneously.
Impact
P0 — pod kill disconnects all active users on the pod, not just the session that triggered fallback.
by AI for @beastoin
Pod ml9h7 killed by liveness probe (10:01 UTC 2026-03-26) while serving 22 WS connections. Root cause:
_create_conversation_fallbackintranscribe.py:715-737runsprocess_conversation()synchronously on the backend-listen event loop when pusher is degraded.Current Behavior
_create_conversation_fallbackprocess_conversation()(line 729) is sync-blocking: OpenAI LLM calls, embeddings, Firestore writes, integration triggersExpected Behavior
Affected Areas
_create_conversation_fallback()runsprocess_conversation()directlyprocess_conversation()calltrigger_external_integrations()call_process_conversation()routes to fallback when pusher degradedcleanup_processing_conversations()also routes to fallbackSolution
Wrap blocking calls in
asyncio.to_thread()and addasyncio.Semaphore(1)to limit concurrent fallback processing per session, preventing cascade when multiple sessions degrade simultaneously.Impact
P0 — pod kill disconnects all active users on the pod, not just the session that triggered fallback.
by AI for @beastoin