Add per-check diagnostics to DeepHealthCheck API#9350
Open
Conversation
Extend the DeepHealthCheck API to return detailed per-check diagnostic information alongside the existing HealthState enum. This enables upstream consumers (e.g. saas-control-plane fault detection) to understand which specific health check failed and why, rather than only knowing a cell is unhealthy. Changes: - New proto: health/v1/message.proto with HealthCheck, HostHealthDetail, and ServiceHealthDetail messages - New enum value: HEALTH_STATE_INTERNAL_ERROR for resolver failures - History handler runs all 5 checks (gRPC health, RPC latency/errors, persistence latency/errors) and returns per-check results with check_type, value, threshold, and human-readable message - Frontend aggregator collects per-host details and propagates them through ServiceHealthDetail - AdminService passes through ServiceHealthDetail to callers - check_type uses string constants (not proto enum) for extensibility; constants live in common/health/ for import by consumers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, a gRPC health check error caused an immediate return, skipping all 4 remaining checks. Now the error is captured as a NOT_SERVING check result with the error message, and all checks continue to run so the full diagnostic picture is always returned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the gRPC health check returns an error, return early with only that check result rather than running the remaining metric checks. The caller still gets the diagnostic info about what failed (check_type + message) instead of a bare error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename healthpb -> healthspb per importas config - Add default clause to switch statement (revive) - Use s.Require().NoError() for error assertions (testifylint) - Use s.InDelta() for float comparisons (testifylint) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines
91
to
+107
| if err != nil { | ||
| h.logger.Warn("failed to ping deep health check", tag.Error(err), tag.ServerName(string(h.serviceName))) | ||
| // Synthetic check: the host health check RPC failed, so we create a | ||
| // HealthCheck entry to propagate the error message upstream. The State | ||
| // here mirrors DeepHealthCheckResponse.State since there's only one check. | ||
| resp = &historyservice.DeepHealthCheckResponse{ | ||
| State: enumsspb.HEALTH_STATE_NOT_SERVING, | ||
| Checks: []*healthspb.HealthCheck{ | ||
| { | ||
| CheckType: health.CheckTypeHostAvailability, | ||
| State: enumsspb.HEALTH_STATE_NOT_SERVING, | ||
| Message: fmt.Sprintf("failed to reach host for health check: %v", err), | ||
| }, | ||
| }, | ||
| } | ||
| } | ||
| if resp == nil { |
There was a problem hiding this comment.
Might be a bug here. Is it guaranteed that healthcheckfn can't return err != nil and resp==nil? In that case, we will blank out err by overriding resp
| healthState := <-receiveCh | ||
| switch healthState { | ||
| case enumsspb.HEALTH_STATE_NOT_SERVING, enumsspb.HEALTH_STATE_UNSPECIFIED: | ||
| result := <-receiveCh |
There was a problem hiding this comment.
Minor: Panic vulnerability here, if one of the goroutines we're waiting on panics, this will hang forever
| } else { | ||
| failedHostCountProportion := failedHostCount / float64(len(hosts)) | ||
| if failedHostCountProportion+hostDeclinedServingProportion > h.hostFailurePercentage() { | ||
| h.logger.Warn("health check exceeded host failure percentage threshold", tag.Float64("host failure percentage threshold", h.hostFailurePercentage()), tag.Float64("host failure percentage", failedHostCountProportion), tag.Float64("host declined serving percentage", hostDeclinedServingProportion)) |
There was a problem hiding this comment.
It would be nice to add an example failure from the failed hosts in this log (We can get it from ensureMinimumProportionOfHosts)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the
DeepHealthCheckAPI to return per-check diagnostic details alongside the existingHealthStateenum. When fault detection triggers a cell failover, operators can now see exactly which health check failed and why — not just that the cell is unhealthy.What changed
health/v1—HealthCheckmessage withcheck_type(string),state,value,threshold, and human-readablemessage.HostHealthDetailandServiceHealthDetailaggregate per-host and per-service results.HEALTH_STATE_INTERNAL_ERROR— for infrastructure failures like membership resolver errors (previously returnedUNSPECIFIED).HostHealthDetail(address, state, checks) and builds aServiceHealthDetailwith diagnostic messages for all paths — including resolver errors and empty membership.ServiceHealthDetailthrough to callers.check_typeuses string constants (common/health/check_types.go) instead of a proto enum for extensibility — new check types can be added without proto changes, and themessagefield provides human-readable context with actual values (e.g."RPC latency 850.00ms exceeded 500.00ms threshold").How it works
The call chain is:
AdminService.DeepHealthCheck()→HealthChecker.Check()→ fan-out to all history hosts in membership →HistoryHandler.DeepHealthCheck()per host.Each history host runs 5 independent checks and returns all results:
grpc_health— is the gRPC health server serving?rpc_latency— average RPC latency vs thresholdrpc_error_ratio— RPC error rate vs thresholdpersistence_latency— DB latency vs thresholdpersistence_error_ratio— DB error rate vs thresholdThe frontend collects results from all hosts in membership, aggregates them, and returns the full breakdown.
Example responses
Healthy cluster (3 hosts, all serving)
{ "state": "HEALTH_STATE_SERVING", "services": [{ "service": "history", "state": "HEALTH_STATE_SERVING", "hosts": [ { "address": "10.0.1.5:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.6:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.7:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] } ] }] }Degraded cluster — 1 host with high RPC latency (3 hosts, 1 failing, under threshold)
The failing host clearly shows which check triggered and the actual vs threshold values. Because only 1/3 hosts failed (33%) and the failure threshold is 25% but we require at least 2 hosts to fail, the overall state remains
SERVING.{ "state": "HEALTH_STATE_SERVING", "services": [{ "service": "history", "state": "HEALTH_STATE_SERVING", "hosts": [ { "address": "10.0.1.5:7234", "state": "HEALTH_STATE_NOT_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_NOT_SERVING", "value": 850.0, "threshold": 500.0, "message": "RPC latency 850.00ms exceeded 500.00ms threshold"}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 120.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.6:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.7:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] } ] }] }Unhealthy cluster — hosts unreachable (6 hosts in membership, 3 unreachable via RPC)
When the frontend cannot reach a host via RPC, it creates a synthetic
host_availabilitycheck with the error. The host appears in the response withNOT_SERVINGand the RPC error message. With 3/6 hosts failing (50% > 25% threshold), the overall state isNOT_SERVING.{ "state": "HEALTH_STATE_NOT_SERVING", "services": [{ "service": "history", "state": "HEALTH_STATE_NOT_SERVING", "hosts": [ { "address": "10.0.1.5:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.6:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.7:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.8:7234", "state": "HEALTH_STATE_NOT_SERVING", "checks": [ {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: rpc error: code = Unavailable desc = connection refused"} ] }, { "address": "10.0.1.9:7234", "state": "HEALTH_STATE_NOT_SERVING", "checks": [ {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: rpc error: code = Unavailable desc = connection refused"} ] }, { "address": "10.0.1.10:7234", "state": "HEALTH_STATE_NOT_SERVING", "checks": [ {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: context deadline exceeded"} ] } ] }] }Host voluntarily draining — gRPC health declined (DECLINED_SERVING)
When a host's gRPC health server reports not serving (e.g. during graceful shutdown), the check returns
DECLINED_SERVING. If enough hosts are in this state (exceeding the declined serving proportion threshold), the overall service state becomesDECLINED_SERVING.{ "state": "HEALTH_STATE_DECLINED_SERVING", "services": [{ "service": "history", "state": "HEALTH_STATE_DECLINED_SERVING", "hosts": [ { "address": "10.0.1.5:7234", "state": "HEALTH_STATE_DECLINED_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_DECLINED_SERVING", "message": "gRPC health server not serving"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.6:7234", "state": "HEALTH_STATE_DECLINED_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_DECLINED_SERVING", "message": "gRPC health server not serving"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] }, { "address": "10.0.1.7:7234", "state": "HEALTH_STATE_SERVING", "checks": [ {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"}, {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0}, {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1}, {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0}, {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1} ] } ] }] }No hosts in membership
When the membership resolver returns an empty host list, the response includes a service-level message but no hosts.
{ "state": "HEALTH_STATE_NOT_SERVING", "services": [{ "service": "history", "state": "HEALTH_STATE_NOT_SERVING", "message": "no available hosts in membership" }] }Membership resolver failure (INTERNAL_ERROR)
When the frontend can't resolve the membership ring at all (infrastructure failure), the response includes
INTERNAL_ERRORwith the resolver error.{ "state": "HEALTH_STATE_INTERNAL_ERROR", "services": [{ "service": "history", "state": "HEALTH_STATE_INTERNAL_ERROR", "message": "failed to get membership resolver: membership monitor not started" }] }Backward compatibility
DeepHealthCheckResponse.state(field 1) unchanged in both history and admin protoschecks,services) are additive (field 2) — old clients simply ignore themGetState()continues to work as beforeRelated
HealthReport+CellHealthEvent(consumer side, ready to use these fields)Test plan
TestHealthCheckerSuitetests pass (19 tests)Test_Check_ServiceDetail_Populated,Test_Check_HostChecks_Propagated,Test_Check_GetResolver_Error(INTERNAL_ERROR + message),Test_Check_No_Available_Hosts(message)go build ./...passesgo.temporal.io/server/common/healthconstants🤖 Generated with Claude Code