Skip to content

Add per-check diagnostics to DeepHealthCheck API#9350

Open
laniehei wants to merge 4 commits intomainfrom
lanie/deep-health-check-impl
Open

Add per-check diagnostics to DeepHealthCheck API#9350
laniehei wants to merge 4 commits intomainfrom
lanie/deep-health-check-impl

Conversation

@laniehei
Copy link
Member

@laniehei laniehei commented Feb 19, 2026

Summary

Extends the DeepHealthCheck API to return per-check diagnostic details alongside the existing HealthState enum. When fault detection triggers a cell failover, operators can now see exactly which health check failed and why — not just that the cell is unhealthy.

What changed

  • New proto package health/v1HealthCheck message with check_type (string), state, value, threshold, and human-readable message. HostHealthDetail and ServiceHealthDetail aggregate per-host and per-service results.
  • New enum value HEALTH_STATE_INTERNAL_ERROR — for infrastructure failures like membership resolver errors (previously returned UNSPECIFIED).
  • History handler now runs all 5 checks unconditionally (gRPC health, RPC latency, RPC error ratio, persistence latency, persistence error ratio) and returns each with actual values and thresholds. Previously it early-returned on first failure.
  • Frontend health checker collects per-host HostHealthDetail (address, state, checks) and builds a ServiceHealthDetail with diagnostic messages for all paths — including resolver errors and empty membership.
  • AdminService passes ServiceHealthDetail through to callers.
  • check_type uses string constants (common/health/check_types.go) instead of a proto enum for extensibility — new check types can be added without proto changes, and the message field provides human-readable context with actual values (e.g. "RPC latency 850.00ms exceeded 500.00ms threshold").

How it works

The call chain is: AdminService.DeepHealthCheck()HealthChecker.Check() → fan-out to all history hosts in membership → HistoryHandler.DeepHealthCheck() per host.

Each history host runs 5 independent checks and returns all results:

  1. grpc_health — is the gRPC health server serving?
  2. rpc_latency — average RPC latency vs threshold
  3. rpc_error_ratio — RPC error rate vs threshold
  4. persistence_latency — DB latency vs threshold
  5. persistence_error_ratio — DB error rate vs threshold

The frontend collects results from all hosts in membership, aggregates them, and returns the full breakdown.

Example responses

Healthy cluster (3 hosts, all serving)

{
  "state": "HEALTH_STATE_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      }
    ]
  }]
}

Degraded cluster — 1 host with high RPC latency (3 hosts, 1 failing, under threshold)

The failing host clearly shows which check triggered and the actual vs threshold values. Because only 1/3 hosts failed (33%) and the failure threshold is 25% but we require at least 2 hosts to fail, the overall state remains SERVING.

{
  "state": "HEALTH_STATE_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_NOT_SERVING", "value": 850.0, "threshold": 500.0, "message": "RPC latency 850.00ms exceeded 500.00ms threshold"},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 120.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      }
    ]
  }]
}

Unhealthy cluster — hosts unreachable (6 hosts in membership, 3 unreachable via RPC)

When the frontend cannot reach a host via RPC, it creates a synthetic host_availability check with the error. The host appears in the response with NOT_SERVING and the RPC error message. With 3/6 hosts failing (50% > 25% threshold), the overall state is NOT_SERVING.

{
  "state": "HEALTH_STATE_NOT_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_NOT_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.8:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: rpc error: code = Unavailable desc = connection refused"}
        ]
      },
      {
        "address": "10.0.1.9:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: rpc error: code = Unavailable desc = connection refused"}
        ]
      },
      {
        "address": "10.0.1.10:7234",
        "state": "HEALTH_STATE_NOT_SERVING",
        "checks": [
          {"check_type": "host_availability", "state": "HEALTH_STATE_NOT_SERVING", "message": "failed to reach host for health check: context deadline exceeded"}
        ]
      }
    ]
  }]
}

Host voluntarily draining — gRPC health declined (DECLINED_SERVING)

When a host's gRPC health server reports not serving (e.g. during graceful shutdown), the check returns DECLINED_SERVING. If enough hosts are in this state (exceeding the declined serving proportion threshold), the overall service state becomes DECLINED_SERVING.

{
  "state": "HEALTH_STATE_DECLINED_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_DECLINED_SERVING",
    "hosts": [
      {
        "address": "10.0.1.5:7234",
        "state": "HEALTH_STATE_DECLINED_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_DECLINED_SERVING", "message": "gRPC health server not serving"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 45.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.01, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 12.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.6:7234",
        "state": "HEALTH_STATE_DECLINED_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_DECLINED_SERVING", "message": "gRPC health server not serving"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 52.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 18.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      },
      {
        "address": "10.0.1.7:7234",
        "state": "HEALTH_STATE_SERVING",
        "checks": [
          {"check_type": "grpc_health", "state": "HEALTH_STATE_SERVING"},
          {"check_type": "rpc_latency", "state": "HEALTH_STATE_SERVING", "value": 38.0, "threshold": 500.0},
          {"check_type": "rpc_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.02, "threshold": 0.1},
          {"check_type": "persistence_latency", "state": "HEALTH_STATE_SERVING", "value": 15.0, "threshold": 500.0},
          {"check_type": "persistence_error_ratio", "state": "HEALTH_STATE_SERVING", "value": 0.0, "threshold": 0.1}
        ]
      }
    ]
  }]
}

No hosts in membership

When the membership resolver returns an empty host list, the response includes a service-level message but no hosts.

{
  "state": "HEALTH_STATE_NOT_SERVING",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_NOT_SERVING",
    "message": "no available hosts in membership"
  }]
}

Membership resolver failure (INTERNAL_ERROR)

When the frontend can't resolve the membership ring at all (infrastructure failure), the response includes INTERNAL_ERROR with the resolver error.

{
  "state": "HEALTH_STATE_INTERNAL_ERROR",
  "services": [{
    "service": "history",
    "state": "HEALTH_STATE_INTERNAL_ERROR",
    "message": "failed to get membership resolver: membership monitor not started"
  }]
}

Backward compatibility

  • DeepHealthCheckResponse.state (field 1) unchanged in both history and admin protos
  • New fields (checks, services) are additive (field 2) — old clients simply ignore them
  • GetState() continues to work as before

Related

Test plan

  • All existing TestHealthCheckerSuite tests pass (19 tests)
  • New tests: Test_Check_ServiceDetail_Populated, Test_Check_HostChecks_Propagated, Test_Check_GetResolver_Error (INTERNAL_ERROR + message), Test_Check_No_Available_Hosts (message)
  • Full go build ./... passes
  • Verify saas-control-plane can import go.temporal.io/server/common/health constants
  • Integration test with actual history service

🤖 Generated with Claude Code

Extend the DeepHealthCheck API to return detailed per-check diagnostic
information alongside the existing HealthState enum. This enables
upstream consumers (e.g. saas-control-plane fault detection) to
understand which specific health check failed and why, rather than
only knowing a cell is unhealthy.

Changes:
- New proto: health/v1/message.proto with HealthCheck, HostHealthDetail,
  and ServiceHealthDetail messages
- New enum value: HEALTH_STATE_INTERNAL_ERROR for resolver failures
- History handler runs all 5 checks (gRPC health, RPC latency/errors,
  persistence latency/errors) and returns per-check results with
  check_type, value, threshold, and human-readable message
- Frontend aggregator collects per-host details and propagates them
  through ServiceHealthDetail
- AdminService passes through ServiceHealthDetail to callers
- check_type uses string constants (not proto enum) for extensibility;
  constants live in common/health/ for import by consumers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laniehei laniehei requested review from a team as code owners February 19, 2026 01:15
laniehei and others added 3 commits February 18, 2026 17:25
Previously, a gRPC health check error caused an immediate return,
skipping all 4 remaining checks. Now the error is captured as a
NOT_SERVING check result with the error message, and all checks
continue to run so the full diagnostic picture is always returned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the gRPC health check returns an error, return early with only
that check result rather than running the remaining metric checks.
The caller still gets the diagnostic info about what failed (check_type
+ message) instead of a bare error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename healthpb -> healthspb per importas config
- Add default clause to switch statement (revive)
- Use s.Require().NoError() for error assertions (testifylint)
- Use s.InDelta() for float comparisons (testifylint)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines 91 to +107
if err != nil {
h.logger.Warn("failed to ping deep health check", tag.Error(err), tag.ServerName(string(h.serviceName)))
// Synthetic check: the host health check RPC failed, so we create a
// HealthCheck entry to propagate the error message upstream. The State
// here mirrors DeepHealthCheckResponse.State since there's only one check.
resp = &historyservice.DeepHealthCheckResponse{
State: enumsspb.HEALTH_STATE_NOT_SERVING,
Checks: []*healthspb.HealthCheck{
{
CheckType: health.CheckTypeHostAvailability,
State: enumsspb.HEALTH_STATE_NOT_SERVING,
Message: fmt.Sprintf("failed to reach host for health check: %v", err),
},
},
}
}
if resp == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a bug here. Is it guaranteed that healthcheckfn can't return err != nil and resp==nil? In that case, we will blank out err by overriding resp

healthState := <-receiveCh
switch healthState {
case enumsspb.HEALTH_STATE_NOT_SERVING, enumsspb.HEALTH_STATE_UNSPECIFIED:
result := <-receiveCh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Panic vulnerability here, if one of the goroutines we're waiting on panics, this will hang forever

} else {
failedHostCountProportion := failedHostCount / float64(len(hosts))
if failedHostCountProportion+hostDeclinedServingProportion > h.hostFailurePercentage() {
h.logger.Warn("health check exceeded host failure percentage threshold", tag.Float64("host failure percentage threshold", h.hostFailurePercentage()), tag.Float64("host failure percentage", failedHostCountProportion), tag.Float64("host declined serving percentage", hostDeclinedServingProportion))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add an example failure from the failed hosts in this log (We can get it from ensureMinimumProportionOfHosts)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments