Skip to content

Enhancement: Unify Primary/Fallback LLM Failover Policy Across Execution Paths #154

@Congregalis

Description

@Congregalis

Summary

The platform defines both primary_model_id and fallback_model_id, but failover behavior is inconsistent across chat/channel/background paths. Some paths only do config-level fallback, some attempt runtime fallback, and some ignore fallback entirely.
This issue proposes one shared failover policy and one shared executor to make behavior consistent and observable.

Current Behavior

  1. Model schema supports primary + fallback
primary_model_id: uuid.UUID | None = None
fallback_model_id: uuid.UUID | None = None
  1. Web chat does primary-first + config-level fallback
if agent.primary_model_id:
... load primary ...
if agent.fallback_model_id:
... load fallback ...
if not llm_model and fallback_llm_model:
llm_model = fallback_llm_model
  1. Runtime fallback exists, but lower layer often returns error text instead of raising
## call_llm catches and returns string
except LLMError as e:
return f"[LLM Error] {e}"
except Exception as e:
return f"[LLM call error] ..."

This weakens outer except-driven fallback logic.

  1. Slack/Teams/Discord/WeCom/DingTalk reuse Feishu _call_agent_llm

These paths share the same failover characteristics as Feishu.

  1. Background services mostly do one-shot selection (primary or fallback)
model_id = agent.primary_model_id or agent.fallback_model_id

No runtime failover retry after selected model fails.

  1. Some paths are primary-only
    Files: backend/app/services/trigger_daemon.py, backend/app/api/gateway.py, backend/app/services/agent_manager.py

Primary is used directly; fallback is not part of the path.

Proposed Solution

A) Add one shared failover executor
Create llm_failover and route all LLM entrypoints through it.

e.g.

async def invoke_with_failover(primary_model, fallback_model, invoke_once, context):
...

B) Unified switching rules

  1. Try primary if available.
  2. If primary missing/unavailable, use fallback directly.
  3. If primary fails with retryable error, retry once on fallback.
  4. If error is non-retryable (auth/validation/schema), do not switch.
  5. Max attempts per request: 2 (primary + fallback).

C) Retryable error scope (To be discussed)

  • Network timeout / connection errors
  • Provider 429
  • Provider 5xx
  • Explicit transient provider errors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions