Skip to content

Add per-service health check endpoints for status page#6032

Closed
atlas-agent-omi[bot] wants to merge 1 commit intomainfrom
atlas/service-health-endpoints
Closed

Add per-service health check endpoints for status page#6032
atlas-agent-omi[bot] wants to merge 1 commit intomainfrom
atlas/service-health-endpoints

Conversation

@atlas-agent-omi
Copy link
Copy Markdown

What

Adds unauthenticated health check endpoints that probe each critical dependency individually.

Endpoints

Endpoint Checks Returns
GET /v1/health/chat Anthropic API (list models) 200 or 503
GET /v1/health/transcription Deepgram API (list projects) 200 or 503
GET /v1/health/ai OpenAI API (list models) 200 or 503
GET /v1/health/storage Firestore (read test) 200 or 503
GET /v1/health/search Typesense (health endpoint) 200 or 503
GET /v1/health/services All above concurrently 200/degraded/503

Why

For external monitoring via Instatus status page (omidotme.instatus.com). Each service gets its own monitor so users can see exactly what's up/down.

Details

  • No auth required (public endpoints for monitoring)
  • 5s timeout per check
  • Cache-Control: no-cache headers
  • Aggregate endpoint runs all checks concurrently with asyncio.gather
  • Returns ok, degraded (some down), or down (all down)

New endpoints (all unauthenticated, 5s timeout):
- GET /v1/health/chat — pings Anthropic API
- GET /v1/health/transcription — pings Deepgram API
- GET /v1/health/ai — pings OpenAI API
- GET /v1/health/storage — pings Firestore
- GET /v1/health/search — pings Typesense
- GET /v1/health/services — aggregate check (all above)

Returns 200 + {status: ok} when healthy, 503 + {status: down, error: ...} when not.
Designed for external monitoring (Instatus/BetterStack/etc).
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR adds a new backend/routers/health.py module with six unauthenticated health check endpoints (/v1/health/chat, /v1/health/transcription, /v1/health/ai, /v1/health/storage, /v1/health/search, /v1/health/services) intended for use with an external status-page monitor. The individual service checks use httpx.AsyncClient with a 5-second timeout and Cache-Control: no-cache headers; the aggregate endpoint runs all checks via asyncio.gather.

Key issues found:

  • Blocking Firestore call in async context (P1): _check_firestore uses the synchronous google.cloud.firestore.Client and calls .get() directly inside an async def. This blocks the event loop for the full duration of the network round-trip, undermining the concurrency benefit of asyncio.gather in /v1/health/services and degrading other in-flight requests to the app.
  • In-function import (P2): from database._client import db is placed inside _check_firestore, violating the project's backend import rules that require all imports to be at module top level.
  • API key name disclosure (P2): The unauthenticated endpoints return specific error strings like "ANTHROPIC_API_KEY not set", exposing infrastructure configuration details to anyone who queries the status page.
  • Unnecessary API key on Typesense /health (P2): Typesense's /health endpoint is publicly accessible without authentication; sending the API key there is unnecessary and risks key exposure in logs/traces.

Confidence Score: 3/5

  • Not safe to merge as-is: the blocking synchronous Firestore call inside an async function will stall the event loop under load, impacting other live requests to the backend.
  • The overall structure and intent of the PR is sound, but the P1 issue — a synchronous Firestore .get() call inside an async def used within asyncio.gather — is a real reliability bug in production. It will block the FastAPI event loop for up to 5 seconds on every /v1/health/storage or /v1/health/services request, degrading concurrent request handling across the entire backend. The fix is straightforward (wrap in run_in_executor or switch to AsyncClient), but it needs to land before this goes live.
  • backend/routers/health.py — specifically the _check_firestore function at line 79

Important Files Changed

Filename Overview
backend/routers/health.py New health check router with per-service endpoints; contains a blocking synchronous Firestore call inside an async function that will stall the event loop, an in-function import violating backend rules, and unauthenticated endpoints that disclose API key configuration status.
backend/main.py Straightforward addition of the health router import and app.include_router(health.router) — no issues.

Sequence Diagram

sequenceDiagram
    participant Monitor as Status Monitor
    participant API as FastAPI Backend
    participant Anthropic as Anthropic API
    participant Deepgram as Deepgram API
    participant OpenAI as OpenAI API
    participant Firestore as Firestore (sync)
    participant Typesense as Typesense

    Monitor->>API: GET /v1/health/services
    API->>API: asyncio.gather(all checks)
    par Concurrent checks
        API->>Anthropic: GET /v1/models (5s timeout)
        Anthropic-->>API: 200 OK
    and
        API->>Deepgram: GET /v1/projects (5s timeout)
        Deepgram-->>API: 200 OK
    and
        API->>OpenAI: GET /v1/models (5s timeout)
        OpenAI-->>API: 200 OK
    and
        API->>Firestore: doc.get() [BLOCKING - holds event loop]
        Firestore-->>API: doc snapshot
    and
        API->>Typesense: GET /health (5s timeout)
        Typesense-->>API: 200 OK
    end
    API-->>Monitor: 200 {status: ok/degraded} or 503 {status: down}
Loading

Reviews (1): Last reviewed commit: "Add per-service health check endpoints f..." | Re-trigger Greptile

Comment on lines +79 to +87
async def _check_firestore() -> dict:
"""Check Firestore connectivity with a minimal read."""
try:
from database._client import db
# Read a nonexistent doc — fast, just checks connectivity
doc = db.collection('_health_check').document('ping').get()
return {"status": "ok"}
except Exception as e:
return {"status": "down", "error": str(e)[:200]}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Blocking Firestore call in async function blocks event loop

_check_firestore is declared async but calls db.collection(...).document(...).get() — which uses the synchronous google.cloud.firestore.Client and performs a blocking network I/O call. Awaiting this inside asyncio.gather in health_services does not give you true concurrency: the synchronous .get() will hold the event loop for its entire duration, stalling all other concurrent async tasks (including the other health checks and any in-flight requests to the FastAPI app).

The fix is to offload the blocking call to a thread pool via asyncio.get_event_loop().run_in_executor:

async def _check_firestore() -> dict:
    """Check Firestore connectivity with a minimal read."""
    try:
        from database._client import db  # move to top of file per import rules
        loop = asyncio.get_event_loop()
        await loop.run_in_executor(
            None,
            lambda: db.collection('_health_check').document('ping').get()
        )
        return {"status": "ok"}
    except Exception as e:
        return {"status": "down", "error": str(e)[:200]}

Alternatively, use the async Firestore client (google.cloud.firestore.AsyncClient).

async def _check_firestore() -> dict:
"""Check Firestore connectivity with a minimal read."""
try:
from database._client import db
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 In-function import violates backend import rules

from database._client import db is placed inside the function body. The backend import rules require all imports to be at the module's top level. Move this to the top of the file alongside the other imports:

Suggested change
from database._client import db
from database._client import db

(Add this line at the top of health.py with the other imports, then remove it from inside _check_firestore.)

Context Used: Backend Python import rules - no in-function impor... (source)

Comment on lines +90 to +108
async def _check_typesense() -> dict:
"""Check Typesense connectivity."""
try:
host = os.getenv('TYPESENSE_HOST', '')
port = os.getenv('TYPESENSE_HOST_PORT', '443')
api_key = os.getenv('TYPESENSE_API_KEY', '')
if not host or not api_key:
return {"status": "down", "error": "TYPESENSE config not set"}
async with httpx.AsyncClient(timeout=TIMEOUT) as client:
r = await client.get(
f"https://{host}:{port}/health",
headers={"X-TYPESENSE-API-KEY": api_key},
)
if r.status_code == 200:
return {"status": "ok"}
else:
return {"status": "down", "error": f"HTTP {r.status_code}"}
except Exception as e:
return {"status": "down", "error": str(e)[:200]}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Typesense /health endpoint is public and doesn't require an API key

The Typesense /health endpoint (documented at https://typesense.org/docs/) does not require authentication — it is intentionally public. Sending the API key in the X-TYPESENSE-API-KEY header on this call is unnecessary (though harmless). More importantly, consider using /health without the key at all to avoid any accidental key exposure in logs or network traces.

Suggested change
async def _check_typesense() -> dict:
"""Check Typesense connectivity."""
try:
host = os.getenv('TYPESENSE_HOST', '')
port = os.getenv('TYPESENSE_HOST_PORT', '443')
api_key = os.getenv('TYPESENSE_API_KEY', '')
if not host or not api_key:
return {"status": "down", "error": "TYPESENSE config not set"}
async with httpx.AsyncClient(timeout=TIMEOUT) as client:
r = await client.get(
f"https://{host}:{port}/health",
headers={"X-TYPESENSE-API-KEY": api_key},
)
if r.status_code == 200:
return {"status": "ok"}
else:
return {"status": "down", "error": f"HTTP {r.status_code}"}
except Exception as e:
return {"status": "down", "error": str(e)[:200]}
async with httpx.AsyncClient(timeout=TIMEOUT) as client:
r = await client.get(
f"https://{host}:{port}/health",
)

Comment on lines +17 to +38
async def _check_anthropic() -> dict:
"""Check Anthropic API connectivity."""
try:
api_key = os.getenv('ANTHROPIC_API_KEY', '')
if not api_key:
return {"status": "down", "error": "ANTHROPIC_API_KEY not set"}
async with httpx.AsyncClient(timeout=TIMEOUT) as client:
r = await client.get(
"https://api.anthropic.com/v1/models",
headers={
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
},
)
if r.status_code == 200:
return {"status": "ok"}
elif r.status_code == 401:
return {"status": "down", "error": "invalid API key or out of credits"}
else:
return {"status": "down", "error": f"HTTP {r.status_code}"}
except Exception as e:
return {"status": "down", "error": str(e)[:200]}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unauthenticated endpoints disclose API key configuration and status

These endpoints are intentionally public (for monitoring), but they return whether specific API keys are missing ("ANTHROPIC_API_KEY not set", "DEEPGRAM_API_KEY not set", etc.). An attacker probing the status page could enumerate which third-party integrations are configured or not on this backend.

Consider replacing the "key not set" messages with a generic "service not configured" or simply returning {"status": "down"} without specifying the reason, so the error details are not publicly exposed.

atlas-agent-omi bot pushed a commit that referenced this pull request Mar 25, 2026
- GET /v1/health/listen: tracks Deepgram WebSocket keepalive failures
  in a rolling 5-minute window. Returns ok/degraded/down based on
  failure count thresholds.
- Adds DG_KEEPALIVE_FAILURES Prometheus counter + rolling failure tracker
  to utils/metrics.py
- Instruments safe_socket.py to record keepalive failures
- Includes listen health in /v1/health/services aggregate
- Also includes all Cloud Run service health endpoints (chat, transcription,
  ai, storage, search) from PR #6032
@beastoin
Copy link
Copy Markdown
Collaborator

AI PRs solely without any verification are not welcome. Please ask the human representative to close the loop and verify before submitting. Thank you. — by CTO

@beastoin beastoin closed this Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hey @atlas-agent-omi[bot] 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant