Add per-check diagnostics to DeepHealthCheck API by laniehei · Pull Request #9350 · temporalio/temporal

laniehei · 2026-02-19T01:15:32Z

Summary

Extends the DeepHealthCheck API to return per-check diagnostic details alongside the existing HealthState enum. When fault detection triggers a cell failover, operators can now see exactly which health check failed and why — not just that the cell is unhealthy.

What changed

New proto package health/v1 — HealthCheck message with check_type (string), state, value, threshold, and human-readable message. HostHealthDetail and ServiceHealthDetail aggregate per-host and per-service results.
New enum value HEALTH_STATE_INTERNAL_ERROR — for infrastructure failures like membership resolver errors (previously returned UNSPECIFIED).
History handler now runs all 5 checks unconditionally (gRPC health, RPC latency, RPC error ratio, persistence latency, persistence error ratio) and returns each with actual values and thresholds. Previously it early-returned on first failure.
Frontend health checker collects per-host HostHealthDetail (address, state, checks) and builds a ServiceHealthDetail with diagnostic messages for all paths — including resolver errors and empty membership.
AdminService passes ServiceHealthDetail through to callers.
check_type uses string constants (common/health/check_types.go) instead of a proto enum for extensibility — new check types can be added without proto changes, and the message field provides human-readable context with actual values (e.g. "RPC latency 850.00ms exceeded 500.00ms threshold").

How it works

The call chain is: AdminService.DeepHealthCheck() → HealthChecker.Check() → fan-out to all history hosts in membership → HistoryHandler.DeepHealthCheck() per host.

Each history host runs 5 independent checks and returns all results:

grpc_health — is the gRPC health server serving?
rpc_latency — average RPC latency vs threshold
rpc_error_ratio — RPC error rate vs threshold
persistence_latency — DB latency vs threshold
persistence_error_ratio — DB error rate vs threshold

The frontend collects results from all hosts in membership, aggregates them, and returns the full breakdown.

Example responses

Healthy cluster (3 hosts, all serving)

{
  "state": "HEALTH_STATE_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      }
    ]
  }]
}

Degraded cluster — 1 host with high RPC latency (3 hosts, 1 failing, under threshold)

The failing host clearly shows which check triggered and the actual vs threshold values. Because only 1/3 hosts failed (33%) and the failure threshold is 25% but we require at least 2 hosts to fail, the overall state remains SERVING.

{
  "state": "HEALTH_STATE_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_NOT_SERVING", "value": 850.0, "threshold": 500.0, "message": "RPC latency 850.00ms exceeded 500.00ms threshold"},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 120.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      }
    ]
  }]
}

Unhealthy cluster — hosts unreachable (6 hosts in membership, 3 unreachable via RPC)

When the frontend cannot reach a host via RPC, it creates a synthetic host_availability check with the error. The host appears in the response with NOT_SERVING and the RPC error message. With 3/6 hosts failing (50% > 25% threshold), the overall state is NOT_SERVING.

{
  "state": "HEALTH_STATE_NOT_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_NOT_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.8:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: rpc error: code = Unavailable desc = connection refused"}
        ]
      },
      {
        "address": "10.0.1.9:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: rpc error: code = Unavailable desc = connection refused"}
        ]
      },
      {
        "address": "10.0.1.10:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: context deadline exceeded"}
        ]
      }
    ]
  }]
}

Host voluntarily draining — gRPC health declined (DECLINED_SERVING)

When a host's gRPC health server reports not serving (e.g. during graceful shutdown), the check returns DECLINED_SERVING. If enough hosts are in this state (exceeding the declined serving proportion threshold), the overall service state becomes DECLINED_SERVING.

{
  "state": "HEALTH_STATE_DECLINED_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_DECLINED_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_DECLINED_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_DECLINED_SERVING", "message": "gRPC health server not serving"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_DECLINED_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_DECLINED_SERVING", "message": "gRPC health server not serving"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      }
    ]
  }]
}

No hosts in membership

When the membership resolver returns an empty host list, the response includes a service-level message but no hosts.

{
  "state": "HEALTH_STATE_NOT_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_NOT_SERVING",
    "message": "no available hosts in membership"
  }]
}

Membership resolver failure (INTERNAL_ERROR)

When the frontend can't resolve the membership ring at all (infrastructure failure), the response includes INTERNAL_ERROR with the resolver error.

{
  "state": "HEALTH_STATE_INTERNAL_ERROR",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_INTERNAL_ERROR",
    "message": "failed to get membership resolver: membership monitor not started"
  }]
}

Backward compatibility

DeepHealthCheckResponse.state (field 1) unchanged in both history and admin protos
New fields (checks, services) are additive (field 2) — old clients simply ignore them
GetState() continues to work as before

Test plan

All existing TestHealthCheckerSuite tests pass (19 tests)
New tests: Test_Check_ServiceDetail_Populated, Test_Check_HostChecks_Propagated, Test_Check_GetResolver_Error (INTERNAL_ERROR + message), Test_Check_No_Available_Hosts (message)
Full go build ./... passes
Verify saas-control-plane can import go.temporal.io/server/common/health constants
Integration test with actual history service

🤖 Generated with Claude Code

Extend the DeepHealthCheck API to return detailed per-check diagnostic information alongside the existing HealthState enum. This enables upstream consumers (e.g. saas-control-plane fault detection) to understand which specific health check failed and why, rather than only knowing a cell is unhealthy. Changes: - New proto: health/v1/message.proto with HealthCheck, HostHealthDetail, and ServiceHealthDetail messages - New enum value: HEALTH_STATE_INTERNAL_ERROR for resolver failures - History handler runs all 5 checks (gRPC health, RPC latency/errors, persistence latency/errors) and returns per-check results with check_type, value, threshold, and human-readable message - Frontend aggregator collects per-host details and propagates them through ServiceHealthDetail - AdminService passes through ServiceHealthDetail to callers - check_type uses string constants (not proto enum) for extensibility; constants live in common/health/ for import by consumers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously, a gRPC health check error caused an immediate return, skipping all 4 remaining checks. Now the error is captured as a NOT_SERVING check result with the error message, and all checks continue to run so the full diagnostic picture is always returned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the gRPC health check returns an error, return early with only that check result rather than running the remaining metric checks. The caller still gets the diagnostic info about what failed (check_type + message) instead of a bare error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rename healthpb -> healthspb per importas config - Add default clause to switch statement (revive) - Use s.Require().NoError() for error assertions (testifylint) - Use s.InDelta() for float comparisons (testifylint) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

temporal-nick · 2026-02-20T01:00:29Z

service/frontend/health_check.go

 			if err != nil {
 				h.logger.Warn("failed to ping deep health check", tag.Error(err), tag.ServerName(string(h.serviceName)))
+				// Synthetic check: the host health check RPC failed, so we create a
+				// HealthCheck entry to propagate the error message upstream. The State
+				// here mirrors DeepHealthCheckResponse.State since there's only one check.
+				resp = &historyservice.DeepHealthCheckResponse{
+					State: enumsspb.HEALTH_STATE_NOT_SERVING,
+					Checks: []*healthspb.HealthCheck{
+						{
+							CheckType: health.CheckTypeHostAvailability,
+							State:     enumsspb.HEALTH_STATE_NOT_SERVING,
+							Message:   fmt.Sprintf("failed to reach host for health check: %v", err),
+						},
+					},
+				}
+			}
+			if resp == nil {


Might be a bug here. Is it guaranteed that healthcheckfn can't return err != nil and resp==nil? In that case, we will blank out err by overriding resp

temporal-nick · 2026-02-20T01:15:25Z

service/frontend/health_check.go

-		healthState := <-receiveCh
-		switch healthState {
-		case enumsspb.HEALTH_STATE_NOT_SERVING, enumsspb.HEALTH_STATE_UNSPECIFIED:
+		result := <-receiveCh


Minor: Panic vulnerability here, if one of the goroutines we're waiting on panics, this will hang forever

temporal-nick · 2026-02-20T01:40:46Z

service/frontend/health_check.go

+	} else {
+		failedHostCountProportion := failedHostCount / float64(len(hosts))
+		if failedHostCountProportion+hostDeclinedServingProportion > h.hostFailurePercentage() {
+			h.logger.Warn("health check exceeded host failure percentage threshold", tag.Float64("host failure percentage threshold", h.hostFailurePercentage()), tag.Float64("host failure percentage", failedHostCountProportion), tag.Float64("host declined serving percentage", hostDeclinedServingProportion))


It would be nice to add an example failure from the failed hosts in this log (We can get it from ensureMinimumProportionOfHosts)

laniehei requested review from a team as code owners February 19, 2026 01:15

laniehei requested review from temporal-nick and yux0 February 19, 2026 01:16

laniehei and others added 3 commits February 18, 2026 17:25

temporal-nick reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-check diagnostics to DeepHealthCheck API#9350

Add per-check diagnostics to DeepHealthCheck API#9350
laniehei wants to merge 4 commits intomainfrom
lanie/deep-health-check-impl

laniehei commented Feb 19, 2026 •

edited

Loading

Uh oh!

temporal-nick Feb 20, 2026

Uh oh!

temporal-nick Feb 20, 2026

Uh oh!

temporal-nick Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

laniehei commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

How it works

Example responses

Healthy cluster (3 hosts, all serving)

Degraded cluster — 1 host with high RPC latency (3 hosts, 1 failing, under threshold)

Unhealthy cluster — hosts unreachable (6 hosts in membership, 3 unreachable via RPC)

Host voluntarily draining — gRPC health declined (DECLINED_SERVING)

No hosts in membership

Membership resolver failure (INTERNAL_ERROR)

Backward compatibility

Related

Test plan

Uh oh!

temporal-nick Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

temporal-nick Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

temporal-nick Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

laniehei commented Feb 19, 2026 •

edited

Loading