Enhancement: Unify Primary/Fallback LLM Failover Policy Across Execution Paths

## Summary

The platform defines both `primary_model_id` and `fallback_model_id`, but failover behavior is inconsistent across chat/channel/background paths. Some paths only do config-level fallback, some attempt runtime fallback, and some ignore fallback entirely.
This issue proposes one shared failover policy and one shared executor to make behavior consistent and observable.

## Current Behavior

1) Model schema supports primary + fallback
```py
primary_model_id: uuid.UUID | None = None
fallback_model_id: uuid.UUID | None = None
```

2) Web chat does primary-first + config-level fallback

```py
if agent.primary_model_id:
... load primary ...
if agent.fallback_model_id:
... load fallback ...
if not llm_model and fallback_llm_model:
llm_model = fallback_llm_model
```

3) Runtime fallback exists, but lower layer often returns error text instead of raising
```py
## call_llm catches and returns string
except LLMError as e:
return f"[LLM Error] {e}"
except Exception as e:
return f"[LLM call error] ..."
```
This weakens outer except-driven fallback logic.

4) Slack/Teams/Discord/WeCom/DingTalk reuse Feishu `_call_agent_llm`

These paths share the same failover characteristics as Feishu.

5) Background services mostly do one-shot selection (primary or fallback)
```py
model_id = agent.primary_model_id or agent.fallback_model_id
```
No runtime failover retry after selected model fails.

6) Some paths are primary-only
Files: `backend/app/services/trigger_daemon.py`, `backend/app/api/gateway.py`, `backend/app/services/agent_manager.py`

Primary is used directly; fallback is not part of the path.


## Proposed Solution

A) Add one shared failover executor
Create `llm_failover` and route all LLM entrypoints through it.

e.g.
```py
async def invoke_with_failover(primary_model, fallback_model, invoke_once, context):
...
```

B) Unified switching rules
1. Try primary if available.
2. If primary missing/unavailable, use fallback directly.
3. If primary fails with retryable error, retry once on fallback.
4. If error is non-retryable (auth/validation/schema), do not switch.
5. Max attempts per request: 2 (primary + fallback).

C) Retryable error scope (To be discussed)
- Network timeout / connection errors
- Provider 429
- Provider 5xx
- Explicit transient provider errors




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Unify Primary/Fallback LLM Failover Policy Across Execution Paths #154

Summary

Current Behavior

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement: Unify Primary/Fallback LLM Failover Policy Across Execution Paths #154

Description

Summary

Current Behavior

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions