A practical guide for adding an LLM-powered QA evaluation agent to any project with a chat-based agent system. Based on the implementation in SaaStoAgent.
- Architecture Overview
- Prerequisites
- Part 1: Backend — LLM Evaluation Service
- Part 2: Frontend — Orchestration Hook
- Part 3: Frontend — QA Panel UI
- Part 4: Approval Handling
- Adaptation Checklist
- Common Pitfalls
┌─────────────────────────────────────────────────────────┐
│ Frontend │
│ │
│ ┌─────────────────┐ ┌────────────────────────┐ │
│ │ Main Chat │ │ QA Agent Panel │ │
│ │ (existing UI) │◄───────│ (form + eval cards) │ │
│ │ │ sends │ │ │
│ │ Messages are │ msgs │ Orchestrates the loop │ │
│ │ VISIBLE here │ via │ via useQAAgent hook │ │
│ │ │ hook │ │ │
│ └────────┬─────────┘ └───────────┬────────────┘ │
│ │ SSE stream │ JSON POST │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ Chat Backend │ │ QA Eval Endpoint │ │
│ │ (existing agent) │ │ (stateless LLM judge) │ │
│ └──────────────────┘ └──────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Key design principle: The QA Agent does NOT run its own chat loop. It drives the existing chat UI by calling the same sendMessage() function the user would. All agent responses appear in the main chat — the QA panel only shows evaluation metadata.
This keeps the QA Agent honest — it tests the actual user-facing experience, not a hidden backend path.
Your project needs:
- A chat interface with a hook/function that sends messages (e.g.,
sendMessage(text, sessionId?)) - A streaming indicator (e.g.,
isStreaming: boolean) that signals when the agent is responding - A messages array that updates reactively as messages arrive
- An LLM API key (OpenAI or equivalent) for the evaluation judge
- (Optional) An approval/confirmation system if your agent has write-action gates
The evaluation service is a stateless LLM judge. It receives the conversation so far plus evaluation criteria, and returns a verdict.
# services/qa/agent.py
import json
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
class QAAgentService:
def __init__(self, api_key: str, model: str = "gpt-5-mini"):
self._llm = ChatOpenAI(
model=model,
temperature=0.1,
api_key=api_key, # MUST pass explicitly — don't rely on env vars
)
async def evaluate_turn(
self,
*,
conversation: list[dict], # [{role: "user"|"assistant"|"qa_agent", content: str}]
query: str, # Original user query that started the test
context: str = "", # Background info (invisible to the agent)
pass_criteria: str = "The agent provides a helpful and accurate response.",
turn_number: int = 1,
max_turns: int = 3,
) -> dict:
system_prompt = (
"You are a QA agent evaluating a conversation between a user and an AI agent system.\n\n"
"You are given:\n"
"- The original user query\n"
"- Context about what the user is trying to do (use this to answer follow-ups)\n"
"- Pass criteria defining what success looks like\n"
"- The conversation so far\n\n"
"After reviewing the latest agent response, decide:\n"
'- "pass": The agent satisfied the pass criteria.\n'
'- "fail": The agent clearly cannot satisfy the criteria.\n'
'- "continue": More interaction needed. Generate a follow_up message.\n\n'
"Respond with JSON:\n"
'{ "verdict": "pass"|"fail"|"continue", "confidence": 0.0-1.0, '
'"reasoning": "brief explanation", "follow_up": "message or null" }\n'
)
# Gentle nudge past limits, but don't force a verdict
if turn_number > max_turns:
system_prompt += (
f"\nNOTE: Turn {turn_number}, past the expected limit of {max_turns}. "
"Consider wrapping up, but only pass/fail if criteria warrant it.\n"
)
# Format conversation
conv_text = ""
for entry in conversation:
label = "User" if entry["role"] in ("user", "qa_agent") else "Agent"
conv_text += f"**{label}:** {entry['content'][:2000]}\n\n"
user_prompt = (
f"## Original Query\n{query}\n\n"
f"## Context\n{context or '(none)'}\n\n"
f"## Pass Criteria\n{pass_criteria}\n\n"
f"## Conversation (turn {turn_number}/{max_turns})\n{conv_text}"
)
resp = await self._llm.ainvoke([
SystemMessage(content=system_prompt),
HumanMessage(content=user_prompt),
])
# Parse — handle markdown fences from some models
text = resp.content.strip()
if text.startswith("```"):
text = text.split("\n", 1)[1] if "\n" in text else text[3:]
text = text.rsplit("```", 1)[0]
result = json.loads(text.strip())
return {
"verdict": result.get("verdict", "fail"),
"confidence": float(result.get("confidence", 0.5)),
"reasoning": result.get("reasoning", ""),
"follow_up": result.get("follow_up"),
}Key decisions:
temperature=0.1— You want consistent, not creative, judgments.- The
contextfield is only for the judge — it's never sent to the actual agent. This lets the QA agent "know" what answers to give when the agent asks clarifying questions. - The
follow_upfield is what gets sent as the next user message when the agent asks follow-up questions.
Expose a single stateless JSON endpoint:
# routes/qa.py
from pydantic import BaseModel, Field
from fastapi import APIRouter
router = APIRouter(prefix="/api/qa")
class QAEvalRequest(BaseModel):
system_id: str
query: str
context: str = ""
pass_criteria: str = "The agent provides a helpful and accurate response."
conversation: list[dict] = Field(default_factory=list)
turn_number: int = Field(default=1, ge=1)
max_turns: int = Field(default=3, ge=1, le=10)
@router.post("/evaluate-turn")
async def evaluate_qa_turn(body: QAEvalRequest):
from services.qa.agent import QAAgentService
service = QAAgentService(api_key="your-key") # Use your config system
return await service.evaluate_turn(
conversation=body.conversation,
query=body.query,
context=body.context,
pass_criteria=body.pass_criteria,
turn_number=body.turn_number,
max_turns=body.max_turns,
)Response shape:
{
"verdict": "pass" | "fail" | "continue",
"confidence": 0.85,
"reasoning": "The agent correctly listed all repositories.",
"follow_up": null | "Yes, please show me the details for repo X"
}export interface QAEvaluation {
turn: number
role: 'user' | 'qa_agent'
verdict: 'pass' | 'fail' | 'continue' | 'error'
confidence: number
reasoning: string
followUp?: string | null
warning?: string
}
export interface QASummary {
verdict: string
confidence: number
reasoning: string
totalTurns: number
elapsedSeconds: number
}
export type ApprovalMode = 'manual' | 'auto-approve' | 'auto-deny'
export interface QAAgentParams {
query: string
context: string
passCriteria: string
maxTurns: number
maxTimeSeconds: number
approvalMode: ApprovalMode
}The hook is the core orchestrator. It plugs into your existing chat system:
interface UseQAAgentOptions {
systemId: string
sendMessage: (text: string, sessionId?: string) => void // Your existing chat send
isStreaming: boolean // Your existing streaming flag
messages: YourMessageType[] // Your existing messages array
activeSessionId: string | null // Chat session ID
}
interface UseQAAgentReturn {
evaluations: QAEvaluation[]
summary: QASummary | null
isRunning: boolean
currentPhase: 'idle' | 'waiting' | 'evaluating' | 'done'
error: string | null
approvalNeeded: boolean
runAgent: (params: QAAgentParams) => void
abort: () => void
reset: () => void
}The loop works like this:
runAgent(params)
→ sendMessage(query) // Message appears in main chat
→ phase = 'waiting'
→ [watch isStreaming: true → false]
→ phase = 'evaluating'
→ POST /api/qa/evaluate-turn // Ask LLM judge
→ if verdict === 'continue':
→ sendMessage(follow_up) // Follow-up appears in main chat
→ back to 'waiting'
→ else:
→ phase = 'done' // Show summary
This is the #1 bug source. React hooks capture values at render time, but the QA loop runs across many renders. You MUST use refs for anything read in async callbacks:
// Keep latest values in refs — update on every render
const messagesRef = useRef(messages)
messagesRef.current = messages
const sendMessageRef = useRef(sendMessage)
sendMessageRef.current = sendMessage
const activeSessionIdRef = useRef(activeSessionId)
activeSessionIdRef.current = activeSessionIdThen in your async handlers, always read from messagesRef.current, never from messages.
Watch isStreaming to know when the agent finished responding:
const wasStreamingRef = useRef(false)
useEffect(() => {
const wasStreaming = wasStreamingRef.current
wasStreamingRef.current = isStreaming
if (wasStreaming && !isStreaming && phaseRef.current === 'waiting') {
handleResponseComplete()
}
}, [isStreaming])This fires once on the true → false transition — NOT on every re-render.
Use both React state (for UI rendering) and a ref (for async logic) to track the current phase:
const [currentPhase, setCurrentPhase] = useState<Phase>('idle')
const phaseRef = useRef<Phase>('idle')
// Always update both
phaseRef.current = 'waiting'
setCurrentPhase('waiting')The ref prevents race conditions where useEffect fires before state updates propagate.
Don't hard-stop at maxTurns — it produces poor evaluations. Instead:
turn > maxTurns → soft warning (amber banner, agent continues)
turn > maxTurns × 2 → hard stop (force fail verdict)
time > maxTime → soft warning (agent continues)
The QA panel has two modes:
Form mode (when idle, no results):
- Query textarea — "The message to send to the agent system"
- Context textarea — "Background info for follow-ups (not sent to agent)"
- Pass Criteria textarea — "What success looks like"
- Collapsible settings: Max Turns, Time Limit, Write Approvals mode
- "Run QA Test" button
Results mode (when running or has results):
- Test config summary (collapsed)
- EvalCard per turn — shows verdict badge, reasoning, follow-up sent, warnings
- Phase indicator with spinner ("Agent is responding...", "Evaluating response...")
- Approval needed banner (pulsing orange, when manual approval mode)
- SummaryCard at end — PASSED/FAILED with confidence, total turns, elapsed time
Mount the QA panel alongside your existing chat page:
// In your chat page component
const [showQAPanel, setShowQAPanel] = useState(false)
return (
<div className="flex h-full">
{/* Main chat — takes remaining space */}
<div className={showQAPanel ? 'flex-1' : 'w-full'}>
<YourExistingChat ... />
</div>
{/* QA panel — fixed width on the right */}
{showQAPanel && (
<div className="w-[420px] flex-shrink-0">
<QAAgentPanel
systemId={selectedSystem}
sendMessage={sendMessage} // From your chat hook
isStreaming={isStreaming} // From your chat hook
messages={messages} // From your chat hook
activeSessionId={sessionId} // From your chat state
/>
</div>
)}
</div>
)Add a toggle button in your chat toolbar:
<Button
variant={showQAPanel ? 'secondary' : 'ghost'}
onClick={() => setShowQAPanel(v => !v)}
>
<FlaskConical className="w-4 h-4" />
QA Agent
</Button>If your agent system has approval gates (e.g., for write operations), the QA Agent needs to handle them. Without this, the agent will block waiting for human input and the QA run will hang.
Watch the messages array for approval payloads:
useEffect(() => {
if (phaseRef.current !== 'waiting' || !paramsRef.current) return
const pendingMsg = messages.find(m => m.pendingApproval)
if (!pendingMsg?.pendingApproval) {
setApprovalNeeded(false)
return
}
const approvalId = pendingMsg.pendingApproval.approvalId
if (handledApprovalIdsRef.current.has(approvalId)) return
if (paramsRef.current.approvalMode === 'manual') {
setApprovalNeeded(true) // Show banner
return
}
// Auto-resolve
handledApprovalIdsRef.current.add(approvalId)
const decision = paramsRef.current.approvalMode === 'auto-approve' ? 'approve' : 'deny'
// Call your approval API
submitApproval(sessionId, approvalId, pendingMsg.pendingApproval.tools, decision)
}, [messages])| Mode | Behavior |
|---|---|
manual |
Shows a pulsing orange banner: "Approval required — handle it in the main chat" |
auto-approve |
Calls approval API immediately with all tools approved |
auto-deny |
Calls approval API immediately with all tools denied |
Track handled approval IDs in a ref (Set<string>) to prevent duplicate submissions when the messages array re-renders.
When porting to a new project:
- Backend: Create evaluation service — adapt the system prompt for your domain if needed
- Backend: Add
/evaluate-turnroute — adapt auth/middleware to your framework - Backend: Wire LLM API key from your config system (don't hardcode, don't rely on env vars)
- Frontend hook: Replace
api.post()with your project's API helper - Frontend hook: Replace
ChatUIMessagetype with your message type - Frontend hook: Replace
storage.getToken()/storage.getWorkspaceId()with your auth helpers - Frontend hook: Adapt
pendingApprovaldetection to your approval system's shape (or remove if no approvals) - Frontend panel: Adapt UI components (Button, ScrollArea, etc.) to your component library
- Frontend chat page: Pass
sendMessage,isStreaming,messages,sessionIdfrom your chat hook - Frontend chat page: Add toggle button and split-view layout
Symptom: Hook reads old messages, sends wrong follow-ups, or silently fails after the first turn.
Fix: Use refs for ALL values read in async callbacks. Update refs on every render (ref.current = value).
Symptom: 500 error from eval endpoint — "api_key client option must be set".
Fix: Pass api_key explicitly in the LLM constructor. Don't rely on OPENAI_API_KEY env var.
Symptom: QA run hangs at "Agent is responding..." after the agent triggers a write action.
Fix: Implement approval detection (Part 4). Without it, the backend blocks forever on the approval gate.
Symptom: After an error, the panel resets to the form view instead of showing the error.
Fix: Include error !== null in your hasResults check so the panel stays in results mode.
Symptom: Agent is making progress but gets force-failed at maxTurns.
Fix: Use soft limits with warnings. Only hard-stop at 2× maxTurns.
Symptom: Compile error about "JSX expressions must have one parent element" in settings section.
Fix: Wrap sibling JSX blocks in a <div> or <> fragment when conditionally rendered.