Add per-service health check endpoints for status page#6032
Add per-service health check endpoints for status page#6032atlas-agent-omi[bot] wants to merge 1 commit intomainfrom
Conversation
New endpoints (all unauthenticated, 5s timeout):
- GET /v1/health/chat — pings Anthropic API
- GET /v1/health/transcription — pings Deepgram API
- GET /v1/health/ai — pings OpenAI API
- GET /v1/health/storage — pings Firestore
- GET /v1/health/search — pings Typesense
- GET /v1/health/services — aggregate check (all above)
Returns 200 + {status: ok} when healthy, 503 + {status: down, error: ...} when not.
Designed for external monitoring (Instatus/BetterStack/etc).
Greptile SummaryThis PR adds a new Key issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Monitor as Status Monitor
participant API as FastAPI Backend
participant Anthropic as Anthropic API
participant Deepgram as Deepgram API
participant OpenAI as OpenAI API
participant Firestore as Firestore (sync)
participant Typesense as Typesense
Monitor->>API: GET /v1/health/services
API->>API: asyncio.gather(all checks)
par Concurrent checks
API->>Anthropic: GET /v1/models (5s timeout)
Anthropic-->>API: 200 OK
and
API->>Deepgram: GET /v1/projects (5s timeout)
Deepgram-->>API: 200 OK
and
API->>OpenAI: GET /v1/models (5s timeout)
OpenAI-->>API: 200 OK
and
API->>Firestore: doc.get() [BLOCKING - holds event loop]
Firestore-->>API: doc snapshot
and
API->>Typesense: GET /health (5s timeout)
Typesense-->>API: 200 OK
end
API-->>Monitor: 200 {status: ok/degraded} or 503 {status: down}
Reviews (1): Last reviewed commit: "Add per-service health check endpoints f..." | Re-trigger Greptile |
| async def _check_firestore() -> dict: | ||
| """Check Firestore connectivity with a minimal read.""" | ||
| try: | ||
| from database._client import db | ||
| # Read a nonexistent doc — fast, just checks connectivity | ||
| doc = db.collection('_health_check').document('ping').get() | ||
| return {"status": "ok"} | ||
| except Exception as e: | ||
| return {"status": "down", "error": str(e)[:200]} |
There was a problem hiding this comment.
Blocking Firestore call in async function blocks event loop
_check_firestore is declared async but calls db.collection(...).document(...).get() — which uses the synchronous google.cloud.firestore.Client and performs a blocking network I/O call. Awaiting this inside asyncio.gather in health_services does not give you true concurrency: the synchronous .get() will hold the event loop for its entire duration, stalling all other concurrent async tasks (including the other health checks and any in-flight requests to the FastAPI app).
The fix is to offload the blocking call to a thread pool via asyncio.get_event_loop().run_in_executor:
async def _check_firestore() -> dict:
"""Check Firestore connectivity with a minimal read."""
try:
from database._client import db # move to top of file per import rules
loop = asyncio.get_event_loop()
await loop.run_in_executor(
None,
lambda: db.collection('_health_check').document('ping').get()
)
return {"status": "ok"}
except Exception as e:
return {"status": "down", "error": str(e)[:200]}Alternatively, use the async Firestore client (google.cloud.firestore.AsyncClient).
| async def _check_firestore() -> dict: | ||
| """Check Firestore connectivity with a minimal read.""" | ||
| try: | ||
| from database._client import db |
There was a problem hiding this comment.
In-function import violates backend import rules
from database._client import db is placed inside the function body. The backend import rules require all imports to be at the module's top level. Move this to the top of the file alongside the other imports:
| from database._client import db | |
| from database._client import db |
(Add this line at the top of health.py with the other imports, then remove it from inside _check_firestore.)
Context Used: Backend Python import rules - no in-function impor... (source)
| async def _check_typesense() -> dict: | ||
| """Check Typesense connectivity.""" | ||
| try: | ||
| host = os.getenv('TYPESENSE_HOST', '') | ||
| port = os.getenv('TYPESENSE_HOST_PORT', '443') | ||
| api_key = os.getenv('TYPESENSE_API_KEY', '') | ||
| if not host or not api_key: | ||
| return {"status": "down", "error": "TYPESENSE config not set"} | ||
| async with httpx.AsyncClient(timeout=TIMEOUT) as client: | ||
| r = await client.get( | ||
| f"https://{host}:{port}/health", | ||
| headers={"X-TYPESENSE-API-KEY": api_key}, | ||
| ) | ||
| if r.status_code == 200: | ||
| return {"status": "ok"} | ||
| else: | ||
| return {"status": "down", "error": f"HTTP {r.status_code}"} | ||
| except Exception as e: | ||
| return {"status": "down", "error": str(e)[:200]} |
There was a problem hiding this comment.
Typesense
/health endpoint is public and doesn't require an API key
The Typesense /health endpoint (documented at https://typesense.org/docs/) does not require authentication — it is intentionally public. Sending the API key in the X-TYPESENSE-API-KEY header on this call is unnecessary (though harmless). More importantly, consider using /health without the key at all to avoid any accidental key exposure in logs or network traces.
| async def _check_typesense() -> dict: | |
| """Check Typesense connectivity.""" | |
| try: | |
| host = os.getenv('TYPESENSE_HOST', '') | |
| port = os.getenv('TYPESENSE_HOST_PORT', '443') | |
| api_key = os.getenv('TYPESENSE_API_KEY', '') | |
| if not host or not api_key: | |
| return {"status": "down", "error": "TYPESENSE config not set"} | |
| async with httpx.AsyncClient(timeout=TIMEOUT) as client: | |
| r = await client.get( | |
| f"https://{host}:{port}/health", | |
| headers={"X-TYPESENSE-API-KEY": api_key}, | |
| ) | |
| if r.status_code == 200: | |
| return {"status": "ok"} | |
| else: | |
| return {"status": "down", "error": f"HTTP {r.status_code}"} | |
| except Exception as e: | |
| return {"status": "down", "error": str(e)[:200]} | |
| async with httpx.AsyncClient(timeout=TIMEOUT) as client: | |
| r = await client.get( | |
| f"https://{host}:{port}/health", | |
| ) |
| async def _check_anthropic() -> dict: | ||
| """Check Anthropic API connectivity.""" | ||
| try: | ||
| api_key = os.getenv('ANTHROPIC_API_KEY', '') | ||
| if not api_key: | ||
| return {"status": "down", "error": "ANTHROPIC_API_KEY not set"} | ||
| async with httpx.AsyncClient(timeout=TIMEOUT) as client: | ||
| r = await client.get( | ||
| "https://api.anthropic.com/v1/models", | ||
| headers={ | ||
| "x-api-key": api_key, | ||
| "anthropic-version": "2023-06-01", | ||
| }, | ||
| ) | ||
| if r.status_code == 200: | ||
| return {"status": "ok"} | ||
| elif r.status_code == 401: | ||
| return {"status": "down", "error": "invalid API key or out of credits"} | ||
| else: | ||
| return {"status": "down", "error": f"HTTP {r.status_code}"} | ||
| except Exception as e: | ||
| return {"status": "down", "error": str(e)[:200]} |
There was a problem hiding this comment.
Unauthenticated endpoints disclose API key configuration and status
These endpoints are intentionally public (for monitoring), but they return whether specific API keys are missing ("ANTHROPIC_API_KEY not set", "DEEPGRAM_API_KEY not set", etc.). An attacker probing the status page could enumerate which third-party integrations are configured or not on this backend.
Consider replacing the "key not set" messages with a generic "service not configured" or simply returning {"status": "down"} without specifying the reason, so the error details are not publicly exposed.
- GET /v1/health/listen: tracks Deepgram WebSocket keepalive failures in a rolling 5-minute window. Returns ok/degraded/down based on failure count thresholds. - Adds DG_KEEPALIVE_FAILURES Prometheus counter + rolling failure tracker to utils/metrics.py - Instruments safe_socket.py to record keepalive failures - Includes listen health in /v1/health/services aggregate - Also includes all Cloud Run service health endpoints (chat, transcription, ai, storage, search) from PR #6032
|
AI PRs solely without any verification are not welcome. Please ask the human representative to close the loop and verify before submitting. Thank you. — by CTO |
|
Hey @atlas-agent-omi[bot] 👋 Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request. After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:
Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out. Thank you for being part of the Omi community! 💜 |
What
Adds unauthenticated health check endpoints that probe each critical dependency individually.
Endpoints
GET /v1/health/chatGET /v1/health/transcriptionGET /v1/health/aiGET /v1/health/storageGET /v1/health/searchGET /v1/health/servicesWhy
For external monitoring via Instatus status page (
omidotme.instatus.com). Each service gets its own monitor so users can see exactly what's up/down.Details
Cache-Control: no-cacheheadersasyncio.gatherok,degraded(some down), ordown(all down)