Add per-service health check endpoints for status page by atlas-agent-omi[bot] · Pull Request #6032 · BasedHardware/omi

atlas-agent-omi · 2026-03-25T08:31:30Z

What

Adds unauthenticated health check endpoints that probe each critical dependency individually.

Endpoints

Endpoint	Checks	Returns
`GET /v1/health/chat`	Anthropic API (list models)	200 or 503
`GET /v1/health/transcription`	Deepgram API (list projects)	200 or 503
`GET /v1/health/ai`	OpenAI API (list models)	200 or 503
`GET /v1/health/storage`	Firestore (read test)	200 or 503
`GET /v1/health/search`	Typesense (health endpoint)	200 or 503
`GET /v1/health/services`	All above concurrently	200/degraded/503

Why

For external monitoring via Instatus status page (omidotme.instatus.com). Each service gets its own monitor so users can see exactly what's up/down.

Details

No auth required (public endpoints for monitoring)
5s timeout per check
Cache-Control: no-cache headers
Aggregate endpoint runs all checks concurrently with asyncio.gather
Returns ok, degraded (some down), or down (all down)

New endpoints (all unauthenticated, 5s timeout): - GET /v1/health/chat — pings Anthropic API - GET /v1/health/transcription — pings Deepgram API - GET /v1/health/ai — pings OpenAI API - GET /v1/health/storage — pings Firestore - GET /v1/health/search — pings Typesense - GET /v1/health/services — aggregate check (all above) Returns 200 + {status: ok} when healthy, 503 + {status: down, error: ...} when not. Designed for external monitoring (Instatus/BetterStack/etc).

greptile-apps · 2026-03-25T08:34:39Z

Greptile Summary

This PR adds a new backend/routers/health.py module with six unauthenticated health check endpoints (/v1/health/chat, /v1/health/transcription, /v1/health/ai, /v1/health/storage, /v1/health/search, /v1/health/services) intended for use with an external status-page monitor. The individual service checks use httpx.AsyncClient with a 5-second timeout and Cache-Control: no-cache headers; the aggregate endpoint runs all checks via asyncio.gather.

Key issues found:

Blocking Firestore call in async context (P1): _check_firestore uses the synchronous google.cloud.firestore.Client and calls .get() directly inside an async def. This blocks the event loop for the full duration of the network round-trip, undermining the concurrency benefit of asyncio.gather in /v1/health/services and degrading other in-flight requests to the app.
In-function import (P2): from database._client import db is placed inside _check_firestore, violating the project's backend import rules that require all imports to be at module top level.
API key name disclosure (P2): The unauthenticated endpoints return specific error strings like "ANTHROPIC_API_KEY not set", exposing infrastructure configuration details to anyone who queries the status page.
Unnecessary API key on Typesense /health (P2): Typesense's /health endpoint is publicly accessible without authentication; sending the API key there is unnecessary and risks key exposure in logs/traces.

Confidence Score: 3/5

Not safe to merge as-is: the blocking synchronous Firestore call inside an async function will stall the event loop under load, impacting other live requests to the backend.
The overall structure and intent of the PR is sound, but the P1 issue — a synchronous Firestore .get() call inside an async def used within asyncio.gather — is a real reliability bug in production. It will block the FastAPI event loop for up to 5 seconds on every /v1/health/storage or /v1/health/services request, degrading concurrent request handling across the entire backend. The fix is straightforward (wrap in run_in_executor or switch to AsyncClient), but it needs to land before this goes live.
backend/routers/health.py — specifically the _check_firestore function at line 79

Important Files Changed

Filename	Overview
backend/routers/health.py	New health check router with per-service endpoints; contains a blocking synchronous Firestore call inside an async function that will stall the event loop, an in-function import violating backend rules, and unauthenticated endpoints that disclose API key configuration status.
backend/main.py	Straightforward addition of the health router import and `app.include_router(health.router)` — no issues.

Sequence Diagram

sequenceDiagram
    participant Monitor as Status Monitor
    participant API as FastAPI Backend
    participant Anthropic as Anthropic API
    participant Deepgram as Deepgram API
    participant OpenAI as OpenAI API
    participant Firestore as Firestore (sync)
    participant Typesense as Typesense

    Monitor->>API: GET /v1/health/services
    API->>API: asyncio.gather(all checks)
    par Concurrent checks
        API->>Anthropic: GET /v1/models (5s timeout)
        Anthropic-->>API: 200 OK
    and
        API->>Deepgram: GET /v1/projects (5s timeout)
        Deepgram-->>API: 200 OK
    and
        API->>OpenAI: GET /v1/models (5s timeout)
        OpenAI-->>API: 200 OK
    and
        API->>Firestore: doc.get() [BLOCKING - holds event loop]
        Firestore-->>API: doc snapshot
    and
        API->>Typesense: GET /health (5s timeout)
        Typesense-->>API: 200 OK
    end
    API-->>Monitor: 200 {status: ok/degraded} or 503 {status: down}

_{Reviews (1): Last reviewed commit: "Add per-service health check endpoints f..." | Re-trigger Greptile}

greptile-apps · 2026-03-25T08:34:42Z

backend/routers/health.py

+async def _check_firestore() -> dict:
+    """Check Firestore connectivity with a minimal read."""
+    try:
+        from database._client import db
+        # Read a nonexistent doc — fast, just checks connectivity
+        doc = db.collection('_health_check').document('ping').get()
+        return {"status": "ok"}
+    except Exception as e:
+        return {"status": "down", "error": str(e)[:200]}


Blocking Firestore call in async function blocks event loop

_check_firestore is declared async but calls db.collection(...).document(...).get() — which uses the synchronous google.cloud.firestore.Client and performs a blocking network I/O call. Awaiting this inside asyncio.gather in health_services does not give you true concurrency: the synchronous .get() will hold the event loop for its entire duration, stalling all other concurrent async tasks (including the other health checks and any in-flight requests to the FastAPI app).

The fix is to offload the blocking call to a thread pool via asyncio.get_event_loop().run_in_executor:

async def _check_firestore() -> dict: """Check Firestore connectivity with a minimal read.""" try: from database._client import db # move to top of file per import rules loop = asyncio.get_event_loop() await loop.run_in_executor( None, lambda: db.collection('_health_check').document('ping').get() ) return {"status": "ok"} except Exception as e: return {"status": "down", "error": str(e)[:200]}

Alternatively, use the async Firestore client (google.cloud.firestore.AsyncClient).

greptile-apps · 2026-03-25T08:34:43Z

backend/routers/health.py

+async def _check_firestore() -> dict:
+    """Check Firestore connectivity with a minimal read."""
+    try:
+        from database._client import db


In-function import violates backend import rules

from database._client import db is placed inside the function body. The backend import rules require all imports to be at the module's top level. Move this to the top of the file alongside the other imports:

Suggested change

from database._client import db

from database._client import db

(Add this line at the top of health.py with the other imports, then remove it from inside _check_firestore.)

Context Used: Backend Python import rules - no in-function impor... (source)

greptile-apps · 2026-03-25T08:34:44Z

backend/routers/health.py

+async def _check_typesense() -> dict:
+    """Check Typesense connectivity."""
+    try:
+        host = os.getenv('TYPESENSE_HOST', '')
+        port = os.getenv('TYPESENSE_HOST_PORT', '443')
+        api_key = os.getenv('TYPESENSE_API_KEY', '')
+        if not host or not api_key:
+            return {"status": "down", "error": "TYPESENSE config not set"}
+        async with httpx.AsyncClient(timeout=TIMEOUT) as client:
+            r = await client.get(
+                f"https://{host}:{port}/health",
+                headers={"X-TYPESENSE-API-KEY": api_key},
+            )
+            if r.status_code == 200:
+                return {"status": "ok"}
+            else:
+                return {"status": "down", "error": f"HTTP {r.status_code}"}
+    except Exception as e:
+        return {"status": "down", "error": str(e)[:200]}


Typesense /health endpoint is public and doesn't require an API key

The Typesense /health endpoint (documented at https://typesense.org/docs/) does not require authentication — it is intentionally public. Sending the API key in the X-TYPESENSE-API-KEY header on this call is unnecessary (though harmless). More importantly, consider using /health without the key at all to avoid any accidental key exposure in logs or network traces.

Suggested change

async def _check_typesense() -> dict:

"""Check Typesense connectivity."""

try:

host = os.getenv('TYPESENSE_HOST', '')

port = os.getenv('TYPESENSE_HOST_PORT', '443')

api_key = os.getenv('TYPESENSE_API_KEY', '')

if not host or not api_key:

return {"status": "down", "error": "TYPESENSE config not set"}

async with httpx.AsyncClient(timeout=TIMEOUT) as client:

r = await client.get(

f"https://{host}:{port}/health",

headers={"X-TYPESENSE-API-KEY": api_key},

)

if r.status_code == 200:

return {"status": "ok"}

else:

return {"status": "down", "error": f"HTTP {r.status_code}"}

except Exception as e:

return {"status": "down", "error": str(e)[:200]}

async with httpx.AsyncClient(timeout=TIMEOUT) as client:

r = await client.get(

f"https://{host}:{port}/health",

)

greptile-apps · 2026-03-25T08:34:45Z

backend/routers/health.py

+async def _check_anthropic() -> dict:
+    """Check Anthropic API connectivity."""
+    try:
+        api_key = os.getenv('ANTHROPIC_API_KEY', '')
+        if not api_key:
+            return {"status": "down", "error": "ANTHROPIC_API_KEY not set"}
+        async with httpx.AsyncClient(timeout=TIMEOUT) as client:
+            r = await client.get(
+                "https://api.anthropic.com/v1/models",
+                headers={
+                    "x-api-key": api_key,
+                    "anthropic-version": "2023-06-01",
+                },
+            )
+            if r.status_code == 200:
+                return {"status": "ok"}
+            elif r.status_code == 401:
+                return {"status": "down", "error": "invalid API key or out of credits"}
+            else:
+                return {"status": "down", "error": f"HTTP {r.status_code}"}
+    except Exception as e:
+        return {"status": "down", "error": str(e)[:200]}


Unauthenticated endpoints disclose API key configuration and status

These endpoints are intentionally public (for monitoring), but they return whether specific API keys are missing ("ANTHROPIC_API_KEY not set", "DEEPGRAM_API_KEY not set", etc.). An attacker probing the status page could enumerate which third-party integrations are configured or not on this backend.

Consider replacing the "key not set" messages with a generic "service not configured" or simply returning {"status": "down"} without specifying the reason, so the error details are not publicly exposed.

- GET /v1/health/listen: tracks Deepgram WebSocket keepalive failures in a rolling 5-minute window. Returns ok/degraded/down based on failure count thresholds. - Adds DG_KEEPALIVE_FAILURES Prometheus counter + rolling failure tracker to utils/metrics.py - Instruments safe_socket.py to record keepalive failures - Includes listen health in /v1/health/services aggregate - Also includes all Cloud Run service health endpoints (chat, transcription, ai, storage, search) from PR #6032

beastoin · 2026-03-28T08:37:12Z

AI PRs solely without any verification are not welcome. Please ask the human representative to close the loop and verify before submitting. Thank you. — by CTO

github-actions · 2026-03-28T08:37:23Z

Hey @atlas-agent-omi[bot] 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

Project standards — Ensuring consistency across the codebase
User needs — Making sure changes align with what our users need
Code best practices — Maintaining code quality and maintainability
Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

atlas-agent-omi bot mentioned this pull request Mar 25, 2026

Add service health endpoints + realtime transcription monitoring #6033

Open

beastoin closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-service health check endpoints for status page#6032

Add per-service health check endpoints for status page#6032
atlas-agent-omi[bot] wants to merge 1 commit intomainfrom
atlas/service-health-endpoints

atlas-agent-omi bot commented Mar 25, 2026

Uh oh!

greptile-apps bot commented Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

beastoin commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	from database._client import db
	from database._client import db

Conversation

atlas-agent-omi bot commented Mar 25, 2026

What

Endpoints

Why

Details

Uh oh!

greptile-apps bot commented Mar 25, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

beastoin commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant