Building a QA Agent for Agentic Projects

A practical guide for adding an LLM-powered QA evaluation agent to any project with a chat-based agent system. Based on the implementation in SaaStoAgent.

Architecture Overview
Prerequisites
Part 1: Backend — LLM Evaluation Service
- 1.1 Evaluation Service
- 1.2 API Route
Part 2: Frontend — Orchestration Hook
Part 3: Frontend — QA Panel UI
- 3.1 Panel Layout
- 3.2 Integration with Chat Page
Part 4: Approval Handling
Adaptation Checklist
Common Pitfalls

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│ Frontend                                                │
│                                                         │
│  ┌─────────────────┐        ┌────────────────────────┐  │
│  │  Main Chat       │        │  QA Agent Panel        │  │
│  │  (existing UI)   │◄───────│  (form + eval cards)   │  │
│  │                  │  sends │                        │  │
│  │  Messages are    │  msgs  │  Orchestrates the loop │  │
│  │  VISIBLE here    │  via   │  via useQAAgent hook   │  │
│  │                  │  hook  │                        │  │
│  └────────┬─────────┘        └───────────┬────────────┘  │
│           │ SSE stream                   │ JSON POST     │
│           ▼                              ▼               │
│ ┌──────────────────┐        ┌──────────────────────────┐ │
│ │ Chat Backend      │        │ QA Eval Endpoint         │ │
│ │ (existing agent)  │        │ (stateless LLM judge)    │ │
│ └──────────────────┘        └──────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Key design principle: The QA Agent does NOT run its own chat loop. It drives the existing chat UI by calling the same sendMessage() function the user would. All agent responses appear in the main chat — the QA panel only shows evaluation metadata.

This keeps the QA Agent honest — it tests the actual user-facing experience, not a hidden backend path.

Prerequisites

Your project needs:

A chat interface with a hook/function that sends messages (e.g., sendMessage(text, sessionId?))
A streaming indicator (e.g., isStreaming: boolean) that signals when the agent is responding
A messages array that updates reactively as messages arrive
An LLM API key (OpenAI or equivalent) for the evaluation judge
(Optional) An approval/confirmation system if your agent has write-action gates

Part 1: Backend — LLM Evaluation Service

1.1 Evaluation Service

The evaluation service is a stateless LLM judge. It receives the conversation so far plus evaluation criteria, and returns a verdict.

# services/qa/agent.py

import json
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

class QAAgentService:
    def __init__(self, api_key: str, model: str = "gpt-5-mini"):
        self._llm = ChatOpenAI(
            model=model,
            temperature=0.1,
            api_key=api_key,  # MUST pass explicitly — don't rely on env vars
        )

    async def evaluate_turn(
        self,
        *,
        conversation: list[dict],   # [{role: "user"|"assistant"|"qa_agent", content: str}]
        query: str,                  # Original user query that started the test
        context: str = "",           # Background info (invisible to the agent)
        pass_criteria: str = "The agent provides a helpful and accurate response.",
        turn_number: int = 1,
        max_turns: int = 3,
    ) -> dict:
        system_prompt = (
            "You are a QA agent evaluating a conversation between a user and an AI agent system.\n\n"
            "You are given:\n"
            "- The original user query\n"
            "- Context about what the user is trying to do (use this to answer follow-ups)\n"
            "- Pass criteria defining what success looks like\n"
            "- The conversation so far\n\n"
            "After reviewing the latest agent response, decide:\n"
            '- "pass": The agent satisfied the pass criteria.\n'
            '- "fail": The agent clearly cannot satisfy the criteria.\n'
            '- "continue": More interaction needed. Generate a follow_up message.\n\n'
            "Respond with JSON:\n"
            '{ "verdict": "pass"|"fail"|"continue", "confidence": 0.0-1.0, '
            '"reasoning": "brief explanation", "follow_up": "message or null" }\n'
        )

        # Gentle nudge past limits, but don't force a verdict
        if turn_number > max_turns:
            system_prompt += (
                f"\nNOTE: Turn {turn_number}, past the expected limit of {max_turns}. "
                "Consider wrapping up, but only pass/fail if criteria warrant it.\n"
            )

        # Format conversation
        conv_text = ""
        for entry in conversation:
            label = "User" if entry["role"] in ("user", "qa_agent") else "Agent"
            conv_text += f"**{label}:** {entry['content'][:2000]}\n\n"

        user_prompt = (
            f"## Original Query\n{query}\n\n"
            f"## Context\n{context or '(none)'}\n\n"
            f"## Pass Criteria\n{pass_criteria}\n\n"
            f"## Conversation (turn {turn_number}/{max_turns})\n{conv_text}"
        )

        resp = await self._llm.ainvoke([
            SystemMessage(content=system_prompt),
            HumanMessage(content=user_prompt),
        ])

        # Parse — handle markdown fences from some models
        text = resp.content.strip()
        if text.startswith("```"):
            text = text.split("\n", 1)[1] if "\n" in text else text[3:]
            text = text.rsplit("```", 1)[0]

        result = json.loads(text.strip())
        return {
            "verdict": result.get("verdict", "fail"),
            "confidence": float(result.get("confidence", 0.5)),
            "reasoning": result.get("reasoning", ""),
            "follow_up": result.get("follow_up"),
        }

Key decisions:

temperature=0.1 — You want consistent, not creative, judgments.
The context field is only for the judge — it's never sent to the actual agent. This lets the QA agent "know" what answers to give when the agent asks clarifying questions.
The follow_up field is what gets sent as the next user message when the agent asks follow-up questions.

1.2 API Route

Expose a single stateless JSON endpoint:

# routes/qa.py

from pydantic import BaseModel, Field
from fastapi import APIRouter

router = APIRouter(prefix="/api/qa")

class QAEvalRequest(BaseModel):
    system_id: str
    query: str
    context: str = ""
    pass_criteria: str = "The agent provides a helpful and accurate response."
    conversation: list[dict] = Field(default_factory=list)
    turn_number: int = Field(default=1, ge=1)
    max_turns: int = Field(default=3, ge=1, le=10)

@router.post("/evaluate-turn")
async def evaluate_qa_turn(body: QAEvalRequest):
    from services.qa.agent import QAAgentService

    service = QAAgentService(api_key="your-key")  # Use your config system
    return await service.evaluate_turn(
        conversation=body.conversation,
        query=body.query,
        context=body.context,
        pass_criteria=body.pass_criteria,
        turn_number=body.turn_number,
        max_turns=body.max_turns,
    )

Response shape:

{
  "verdict": "pass" | "fail" | "continue",
  "confidence": 0.85,
  "reasoning": "The agent correctly listed all repositories.",
  "follow_up": null | "Yes, please show me the details for repo X"
}

Part 2: Frontend — Orchestration Hook

2.1 Types

export interface QAEvaluation {
  turn: number
  role: 'user' | 'qa_agent'
  verdict: 'pass' | 'fail' | 'continue' | 'error'
  confidence: number
  reasoning: string
  followUp?: string | null
  warning?: string
}

export interface QASummary {
  verdict: string
  confidence: number
  reasoning: string
  totalTurns: number
  elapsedSeconds: number
}

export type ApprovalMode = 'manual' | 'auto-approve' | 'auto-deny'

export interface QAAgentParams {
  query: string
  context: string
  passCriteria: string
  maxTurns: number
  maxTimeSeconds: number
  approvalMode: ApprovalMode
}

2.2 The useQAAgent Hook

The hook is the core orchestrator. It plugs into your existing chat system:

interface UseQAAgentOptions {
  systemId: string
  sendMessage: (text: string, sessionId?: string) => void  // Your existing chat send
  isStreaming: boolean                                       // Your existing streaming flag
  messages: YourMessageType[]                                // Your existing messages array
  activeSessionId: string | null                             // Chat session ID
}

interface UseQAAgentReturn {
  evaluations: QAEvaluation[]
  summary: QASummary | null
  isRunning: boolean
  currentPhase: 'idle' | 'waiting' | 'evaluating' | 'done'
  error: string | null
  approvalNeeded: boolean
  runAgent: (params: QAAgentParams) => void
  abort: () => void
  reset: () => void
}

The loop works like this:

runAgent(params)
  → sendMessage(query)          // Message appears in main chat
  → phase = 'waiting'
  → [watch isStreaming: true → false]
  → phase = 'evaluating'
  → POST /api/qa/evaluate-turn  // Ask LLM judge
  → if verdict === 'continue':
      → sendMessage(follow_up)  // Follow-up appears in main chat
      → back to 'waiting'
  → else:
      → phase = 'done'          // Show summary

2.3 Critical Patterns

Stale Closure Prevention

This is the #1 bug source. React hooks capture values at render time, but the QA loop runs across many renders. You MUST use refs for anything read in async callbacks:

// Keep latest values in refs — update on every render
const messagesRef = useRef(messages)
messagesRef.current = messages
const sendMessageRef = useRef(sendMessage)
sendMessageRef.current = sendMessage
const activeSessionIdRef = useRef(activeSessionId)
activeSessionIdRef.current = activeSessionId

Then in your async handlers, always read from messagesRef.current, never from messages.

Streaming Transition Detection

Watch isStreaming to know when the agent finished responding:

const wasStreamingRef = useRef(false)

useEffect(() => {
  const wasStreaming = wasStreamingRef.current
  wasStreamingRef.current = isStreaming

  if (wasStreaming && !isStreaming && phaseRef.current === 'waiting') {
    handleResponseComplete()
  }
}, [isStreaming])

This fires once on the true → false transition — NOT on every re-render.

Phase State Synchronization

Use both React state (for UI rendering) and a ref (for async logic) to track the current phase:

const [currentPhase, setCurrentPhase] = useState<Phase>('idle')
const phaseRef = useRef<Phase>('idle')

// Always update both
phaseRef.current = 'waiting'
setCurrentPhase('waiting')

The ref prevents race conditions where useEffect fires before state updates propagate.

Soft Limits vs Hard Stops

Don't hard-stop at maxTurns — it produces poor evaluations. Instead:

turn > maxTurns      → soft warning (amber banner, agent continues)
turn > maxTurns × 2  → hard stop (force fail verdict)
time > maxTime       → soft warning (agent continues)

Part 3: Frontend — QA Panel UI

3.1 Panel Layout

The QA panel has two modes:

Form mode (when idle, no results):

Query textarea — "The message to send to the agent system"
Context textarea — "Background info for follow-ups (not sent to agent)"
Pass Criteria textarea — "What success looks like"
Collapsible settings: Max Turns, Time Limit, Write Approvals mode
"Run QA Test" button

Results mode (when running or has results):

Test config summary (collapsed)
EvalCard per turn — shows verdict badge, reasoning, follow-up sent, warnings
Phase indicator with spinner ("Agent is responding...", "Evaluating response...")
Approval needed banner (pulsing orange, when manual approval mode)
SummaryCard at end — PASSED/FAILED with confidence, total turns, elapsed time

3.2 Integration with Chat Page

Mount the QA panel alongside your existing chat page:

// In your chat page component
const [showQAPanel, setShowQAPanel] = useState(false)

return (
  <div className="flex h-full">
    {/* Main chat — takes remaining space */}
    <div className={showQAPanel ? 'flex-1' : 'w-full'}>
      <YourExistingChat ... />
    </div>

    {/* QA panel — fixed width on the right */}
    {showQAPanel && (
      <div className="w-[420px] flex-shrink-0">
        <QAAgentPanel
          systemId={selectedSystem}
          sendMessage={sendMessage}      // From your chat hook
          isStreaming={isStreaming}        // From your chat hook
          messages={messages}             // From your chat hook
          activeSessionId={sessionId}     // From your chat state
        />
      </div>
    )}
  </div>
)

Add a toggle button in your chat toolbar:

<Button
  variant={showQAPanel ? 'secondary' : 'ghost'}
  onClick={() => setShowQAPanel(v => !v)}
>
  <FlaskConical className="w-4 h-4" />
  QA Agent
</Button>

Part 4: Approval Handling

If your agent system has approval gates (e.g., for write operations), the QA Agent needs to handle them. Without this, the agent will block waiting for human input and the QA run will hang.

Detection

Watch the messages array for approval payloads:

useEffect(() => {
  if (phaseRef.current !== 'waiting' || !paramsRef.current) return

  const pendingMsg = messages.find(m => m.pendingApproval)
  if (!pendingMsg?.pendingApproval) {
    setApprovalNeeded(false)
    return
  }

  const approvalId = pendingMsg.pendingApproval.approvalId
  if (handledApprovalIdsRef.current.has(approvalId)) return

  if (paramsRef.current.approvalMode === 'manual') {
    setApprovalNeeded(true)  // Show banner
    return
  }

  // Auto-resolve
  handledApprovalIdsRef.current.add(approvalId)
  const decision = paramsRef.current.approvalMode === 'auto-approve' ? 'approve' : 'deny'
  
  // Call your approval API
  submitApproval(sessionId, approvalId, pendingMsg.pendingApproval.tools, decision)
}, [messages])

Three modes

Mode	Behavior
`manual`	Shows a pulsing orange banner: "Approval required — handle it in the main chat"
`auto-approve`	Calls approval API immediately with all tools approved
`auto-deny`	Calls approval API immediately with all tools denied

Deduplication

Track handled approval IDs in a ref (Set<string>) to prevent duplicate submissions when the messages array re-renders.

Adaptation Checklist

When porting to a new project:

Common Pitfalls

1. Stale Closures (React)

Symptom: Hook reads old messages, sends wrong follow-ups, or silently fails after the first turn.
Fix: Use refs for ALL values read in async callbacks. Update refs on every render (ref.current = value).

2. Missing API Key

Symptom: 500 error from eval endpoint — "api_key client option must be set".
Fix: Pass api_key explicitly in the LLM constructor. Don't rely on OPENAI_API_KEY env var.

3. Approval Deadlock

Symptom: QA run hangs at "Agent is responding..." after the agent triggers a write action.
Fix: Implement approval detection (Part 4). Without it, the backend blocks forever on the approval gate.

4. Invisible Errors

Symptom: After an error, the panel resets to the form view instead of showing the error.
Fix: Include error !== null in your hasResults check so the panel stays in results mode.

5. Hard Turn Limits Produce Bad Evaluations

Symptom: Agent is making progress but gets force-failed at maxTurns.
Fix: Use soft limits with warnings. Only hard-stop at 2× maxTurns.

6. JSX Fragment Issues

Symptom: Compile error about "JSX expressions must have one parent element" in settings section.
Fix: Wrap sibling JSX blocks in a <div> or <> fragment when conditionally rendered.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
readme.md		readme.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Building a QA Agent for Agentic Projects

Table of Contents

Architecture Overview

Prerequisites

Part 1: Backend — LLM Evaluation Service

1.1 Evaluation Service

1.2 API Route

Part 2: Frontend — Orchestration Hook

2.1 Types

2.2 The useQAAgent Hook

2.3 Critical Patterns

Stale Closure Prevention

Streaming Transition Detection

Phase State Synchronization

Soft Limits vs Hard Stops

Part 3: Frontend — QA Panel UI

3.1 Panel Layout

3.2 Integration with Chat Page

Part 4: Approval Handling

Detection

Three modes

Deduplication

Adaptation Checklist

Common Pitfalls

1. Stale Closures (React)

2. Missing API Key

3. Approval Deadlock

4. Invisible Errors

5. Hard Turn Limits Produce Bad Evaluations

6. JSX Fragment Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages