From bd518d8ba2db8837565af7167c7ac8e7cbce820c Mon Sep 17 00:00:00 2001
From: Charlie Le <charlie_le@apple.com>
Date: Sat, 7 Mar 2026 10:34:01 -0800
Subject: [PATCH 1/6] Add proposal for per-tenant TSDB status API

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
---
 docs/proposals/per-tenant-tsdb-status-api.md | 192 +++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 docs/proposals/per-tenant-tsdb-status-api.md
diff --git a/docs/proposals/per-tenant-tsdb-status-api.md b/docs/proposals/per-tenant-tsdb-status-api.md
new file mode 100644
index 00000000000..b7247228dd1
--- /dev/null
+++ b/docs/proposals/per-tenant-tsdb-status-api.md
@@ -0,0 +1,192 @@
+---
+title: "Per-Tenant TSDB Status API"
+linkTitle: "Per-Tenant TSDB Status API"
+weight: 1
+slug: per-tenant-tsdb-status-api
+---
+
+- Author: [Charlie Le](https://github.com/CharlieTLe)
+- Date: March 2026
+- Status: Draft
+
+## Background
+
+High-cardinality series is one of the most common operational challenges for Prometheus-based systems. When a tenant has too many active series, it can lead to increased resource usage in ingesters, slower queries, and ultimately hitting per-tenant series limits.
+
+Currently, Cortex tenants lack visibility into which metrics, labels, and label-value pairs contribute the most series in ingesters. Without this information, debugging high-cardinality issues requires operators to inspect TSDB internals directly on ingester instances, which is impractical in a multi-tenant, distributed environment.
+
+Prometheus itself exposes a `/api/v1/status/tsdb` endpoint that provides cardinality statistics from the TSDB head. This proposal brings equivalent functionality to Cortex as a multi-tenant, distributed API.
+
+## Goal
+
+Expose per-tenant TSDB head cardinality statistics via a REST API endpoint on the Cortex query path. The endpoint should:
+
+1. Be compatible with the Prometheus `/api/v1/status/tsdb` response format.
+2. Aggregate statistics across all ingesters that hold data for the requesting tenant.
+3. Correctly account for replication factor when summing series counts and memory usage.
+4. Respect multi-tenancy, ensuring tenants can only see their own data.
+
+## Out of Scope
+
+- **Long-term storage cardinality analysis**: This endpoint only covers in-memory TSDB head data in ingesters. Analyzing cardinality across compacted blocks in object storage is a separate concern. A future long-term cardinality API could reuse portable fields (see [Extensibility](#extensibility-to-long-term-storage)) or introduce a separate endpoint.
+- **Automated cardinality limiting**: This is a read-only diagnostic endpoint; it does not enforce or suggest limits.
+- **Cardinality reduction actions**: The endpoint reports statistics but does not provide mechanisms to drop or relabel series.
+
+## Proposed Design
+
+### Endpoint
+
+```
+GET /api/v1/status/tsdb?limit=N
+```
+
+- **Authentication**: Requires `X-Scope-OrgID` header (standard Cortex tenant authentication).
+- **Query Parameter**: `limit` (optional, default 10) - controls the number of top items returned per category.
+- **Legacy Path**: Also registered at `<legacy-prefix>/api/v1/status/tsdb`.
+
+### Architecture
+
+The request flows through the Querier's HTTP handler, which delegates to the in-process Distributor for ingester fan-out:
+
+```
+Client → HTTP Handler (Querier) → In-process Distributor → gRPC Fan-out (Ingesters) → Aggregation (Distributor) → JSON Response
+```
+
+1. **HTTP Handler** (`TSDBStatusHandler` in `pkg/querier/tsdb_status_handler.go`): Registered via `NewQuerierHandler` in `pkg/api/handlers.go`. Parses the `limit` query parameter and calls the distributor's `TSDBStatus` method.
+2. **Distributor Fan-out** (`TSDBStatus` in `pkg/distributor/distributor.go`): The Querier process holds an in-process Distributor instance (initialized via the `DistributorService` module). This instance uses `GetIngestersForMetadata` to discover all ingesters for the tenant, then sends a `TSDBStatusRequest` gRPC call to each ingester in the replication set.
+3. **Ingester** (`TSDBStatus` in `pkg/ingester/ingester.go`): Retrieves the tenant's TSDB head and calls `db.Head().Stats(labels.MetricName, limit)` to get cardinality statistics from the Prometheus TSDB library.
+4. **Aggregation**: The distributor merges responses from all ingesters and returns the combined result.
+
+### gRPC Definition
+
+A new `TSDBStatus` RPC is added to the Ingester service in `pkg/ingester/client/ingester.proto`:
+
+```protobuf
+rpc TSDBStatus(TSDBStatusRequest) returns (TSDBStatusResponse) {};
+
+message TSDBStatusRequest {
+  int32 limit = 1;
+}
+
+message TSDBStatusResponse {
+  uint64 num_series = 1;
+  int64 min_time = 2;
+  int64 max_time = 3;
+  int32 num_label_pairs = 4;
+  repeated TSDBStatItem series_count_by_metric_name = 5;
+  repeated TSDBStatItem label_value_count_by_label_name = 6;
+  repeated TSDBStatItem memory_in_bytes_by_label_name = 7;
+  repeated TSDBStatItem series_count_by_label_value_pair = 8;
+}
+
+message TSDBStatItem {
+  string name = 1;
+  uint64 value = 2;
+}
+```
+
+### Aggregation Logic
+
+Because each series is replicated across multiple ingesters (controlled by the replication factor), the aggregation logic must account for this when merging responses:
+
+| Field | Aggregation Strategy |
+|---|---|
+| `numSeries` | Sum across ingesters, divide by replication factor |
+| `minTime` | Minimum across all ingesters |
+| `maxTime` | Maximum across all ingesters |
+| `numLabelPairs` | Maximum across ingesters |
+| `seriesCountByMetricName` | Sum per metric, divide by RF, return top N |
+| `labelValueCountByLabelName` | Maximum per label (unique counts, not affected by replication) |
+| `memoryInBytesByLabelName` | Sum per label, divide by RF, return top N |
+| `seriesCountByLabelValuePair` | Sum per pair, divide by RF, return top N |
+
+The `topNStats` helper function handles the sort-and-truncate step: it divides values by the replication factor, sorts descending by value, and returns the top N items.
+
+### Response Format
+
+The JSON response uses a flat structure for head statistics:
+
+```json
+{
+  "numSeries": 1500,
+  "minTime": 1709740800000,
+  "maxTime": 1709748000000,
+  "numLabelPairs": 42,
+  "seriesCountByMetricName": [
+    {"name": "http_requests_total", "value": 500},
+    {"name": "process_cpu_seconds_total", "value": 200}
+  ],
+  "labelValueCountByLabelName": [
+    {"name": "instance", "value": 50},
+    {"name": "job", "value": 10}
+  ],
+  "memoryInBytesByLabelName": [
+    {"name": "instance", "value": 25600},
+    {"name": "job", "value": 5120}
+  ],
+  "seriesCountByLabelValuePair": [
+    {"name": "job=api-server", "value": 300},
+    {"name": "instance=host1:9090", "value": 150}
+  ]
+}
+```
+
+### API Compatibility with Prometheus
+
+The response format intentionally diverges from the upstream Prometheus `/api/v1/status/tsdb` endpoint in two ways:
+
+1. **Flat structure vs nested `headStats`**: Prometheus wraps `numSeries`, `numLabelPairs`, `chunkCount`, `minTime`, and `maxTime` inside a `headStats` object. This proposal uses a flat structure at the top level instead, which is simpler for consumers but means existing Prometheus client libraries cannot parse the response directly.
+
+2. **`chunkCount` omitted**: Prometheus includes a `chunkCount` field (from `prometheus_tsdb_head_chunks`). In a distributed system with replication, chunk counts across ingesters cannot be meaningfully aggregated — chunks are an ingester-local storage detail, and summing/dividing by the replication factor does not produce a useful number.
+
+**Open question**: Should we adopt the `headStats` wrapper to maintain client compatibility with Prometheus tooling? The trade-off is compatibility vs simplicity — the flat format is easier to consume for Cortex-specific clients, but adopting the Prometheus format would allow reuse of existing client libraries.
+
+### Extensibility to Long-Term Storage
+
+Some fields in the response are inherently specific to the in-memory TSDB head and would not translate to a long-term storage cardinality API:
+
+| Field | Head-specific? | Notes |
+|---|---|---|
+| `seriesCountByMetricName` | No | Portable to block storage |
+| `labelValueCountByLabelName` | No | Portable to block storage |
+| `seriesCountByLabelValuePair` | No | Portable to block storage |
+| `memoryInBytesByLabelName` | **Yes** | In-memory byte usage has no analogue in object storage |
+| `minTime` / `maxTime` | **Yes** | Reflects head time range, not total storage |
+| `numSeries` | Partially | Head-only count; block storage would have a different count |
+| `numLabelPairs` | Partially | Head-only count |
+
+If a long-term storage cardinality API is added in the future, the portable fields (`seriesCountByMetricName`, `labelValueCountByLabelName`, `seriesCountByLabelValuePair`) could share a common response format. Head-specific fields like `memoryInBytesByLabelName` would remain scoped to this endpoint. This could be achieved by either adding a `source=head|blocks` query parameter to this endpoint or introducing a separate endpoint for block storage cardinality.
+
+### Multi-Tenancy
+
+Tenant isolation is enforced through the existing Cortex authentication middleware. The `X-Scope-OrgID` header identifies the tenant, and the ingester only returns statistics from that tenant's TSDB head. No cross-tenant data leakage is possible because each tenant has a separate TSDB instance in the ingester.
+
+## Design Alternatives
+
+### Distributor vs Querier Routing
+
+This design routes the endpoint through the **Querier**, which handles the HTTP request and delegates to the in-process Distributor for ingester fan-out and aggregation. An alternative is to route through the **Distributor** directly.
+
+**Current approach (Querier):**
+- Provides logical separation — this is a read-only diagnostic endpoint and belongs on the read path alongside other query APIs.
+- Follows the pattern used by the `/api/v1/metadata` endpoint, which is registered via `NewQuerierHandler` and delegates to the Distributor's `MetricsMetadata` method.
+- Requires adding `TSDBStatus` to the Querier's Distributor interface (`pkg/querier/distributor_queryable.go`) and a handler in the Querier package.
+
+**Alternative (Distributor):**
+- Follows the pattern used by the `UserStats` endpoint, which is registered directly on the Distributor.
+- Slightly simpler — no need to thread the method through the Querier's Distributor interface.
+
+Note that both approaches have the same number of network hops. Even in microservices mode, the Querier process initializes an in-process Distributor instance via the `DistributorService` module (a `UserInvisibleModule` dependency of `Queryable`). This in-process Distributor holds its own ingester client pool and connects directly to ingesters via gRPC. The choice between Querier and Distributor routing only affects which process serves the HTTP request, not the number of network hops.
+
+## Implementation
+
+The implementation spans the following key files:
+
+- `pkg/api/handlers.go` - Route registration in `NewQuerierHandler`
+- `pkg/querier/tsdb_status_handler.go` - HTTP handler (`TSDBStatusHandler`)
+- `pkg/querier/distributor_queryable.go` - `TSDBStatus` added to the Distributor interface
+- `pkg/distributor/distributor.go` - Fan-out to ingesters and aggregation logic (`TSDBStatus`, `topNStats`)
+- `pkg/ingester/ingester.go` - Per-tenant TSDB head stats retrieval (`TSDBStatus`, `statsToPB`)
+- `pkg/ingester/client/ingester.proto` - gRPC message definitions (`TSDBStatusRequest`, `TSDBStatusResponse`, `TSDBStatItem`)
+- `docs/api/_index.md` - API documentation
+- `integration/api_endpoints_test.go` - Integration tests

From d8388cc8b18d132f270d71ac7a932dfafa4f84ae Mon Sep 17 00:00:00 2001
From: Charlie Le <charlie_le@apple.com>
Date: Fri, 13 Mar 2026 12:38:45 -0700
Subject: [PATCH 2/6] Extend TSDB status proposal with long-term storage
 cardinality via store gateways

Add source=blocks query parameter to analyze cardinality from compacted
blocks in object storage. The blocks path fans out to store gateways,
which compute statistics from block index headers (cheap label value
counts) and posting list expansion (exact series counts per metric).
Results are cached per immutable block.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
---
 docs/proposals/per-tenant-tsdb-status-api.md | 241 +++++++++++++++++--
 1 file changed, 217 insertions(+), 24 deletions(-)

diff --git a/docs/proposals/per-tenant-tsdb-status-api.md b/docs/proposals/per-tenant-tsdb-status-api.md
index b7247228dd1..a9f88fc0732 100644
--- a/docs/proposals/per-tenant-tsdb-status-api.md
+++ b/docs/proposals/per-tenant-tsdb-status-api.md
@@ -19,16 +19,16 @@ Prometheus itself exposes a `/api/v1/status/tsdb` endpoint that provides cardina
 
 ## Goal
 
-Expose per-tenant TSDB head cardinality statistics via a REST API endpoint on the Cortex query path. The endpoint should:
+Expose per-tenant cardinality statistics via a REST API endpoint on the Cortex query path. The endpoint should:
 
 1. Be compatible with the Prometheus `/api/v1/status/tsdb` response format.
-2. Aggregate statistics across all ingesters that hold data for the requesting tenant.
-3. Correctly account for replication factor when summing series counts and memory usage.
-4. Respect multi-tenancy, ensuring tenants can only see their own data.
+2. Support two data sources: in-memory TSDB head data from ingesters and compacted blocks from long-term object storage via store gateways.
+3. Aggregate statistics across all ingesters or store gateways that hold data for the requesting tenant.
+4. Correctly account for replication factor when summing series counts and memory usage.
+5. Respect multi-tenancy, ensuring tenants can only see their own data.
 
 ## Out of Scope
 
-- **Long-term storage cardinality analysis**: This endpoint only covers in-memory TSDB head data in ingesters. Analyzing cardinality across compacted blocks in object storage is a separate concern. A future long-term cardinality API could reuse portable fields (see [Extensibility](#extensibility-to-long-term-storage)) or introduce a separate endpoint.
 - **Automated cardinality limiting**: This is a read-only diagnostic endpoint; it does not enforce or suggest limits.
 - **Cardinality reduction actions**: The endpoint reports statistics but does not provide mechanisms to drop or relabel series.
 
@@ -37,15 +37,21 @@ Expose per-tenant TSDB head cardinality statistics via a REST API endpoint on th
 ### Endpoint
 
 ```
-GET /api/v1/status/tsdb?limit=N
+GET /api/v1/status/tsdb?limit=N&source=head|blocks
 ```
 
 - **Authentication**: Requires `X-Scope-OrgID` header (standard Cortex tenant authentication).
-- **Query Parameter**: `limit` (optional, default 10) - controls the number of top items returned per category.
+- **Query Parameters**:
+  - `limit` (optional, default 10) - controls the number of top items returned per category.
+  - `source` (optional, default `head`) - selects the data source. `head` queries ingester TSDB heads, `blocks` queries compacted blocks in long-term storage via store gateways.
 - **Legacy Path**: Also registered at `<legacy-prefix>/api/v1/status/tsdb`.
 
 ### Architecture
 
+The HTTP handler parses the `source` parameter and delegates to the appropriate backend.
+
+#### Head Path (`source=head`)
+
 The request flows through the Querier's HTTP handler, which delegates to the in-process Distributor for ingester fan-out:
 
 ```
@@ -57,8 +63,23 @@ Client → HTTP Handler (Querier) → In-process Distributor → gRPC Fan-out (I
 3. **Ingester** (`TSDBStatus` in `pkg/ingester/ingester.go`): Retrieves the tenant's TSDB head and calls `db.Head().Stats(labels.MetricName, limit)` to get cardinality statistics from the Prometheus TSDB library.
 4. **Aggregation**: The distributor merges responses from all ingesters and returns the combined result.
 
+#### Blocks Path (`source=blocks`)
+
+The request flows through the Querier's HTTP handler, which fans out to store gateways:
+
+```
+Client → HTTP Handler (Querier) → gRPC Fan-out (Store Gateways) → Per-Tenant Block Index Analysis → Aggregation (Querier) → JSON Response
+```
+
+1. **HTTP Handler** (`TSDBStatusHandler` in `pkg/querier/tsdb_status_handler.go`): Parses `limit` and `source=blocks`, then calls the blocks store's `TSDBStatus` method.
+2. **Store Gateway Fan-out**: The Querier uses its existing store gateway client pool (`BlocksStoreSet`) to discover store gateways that hold blocks for the tenant, then sends a `TSDBStatus` gRPC call to each relevant store gateway instance.
+3. **Store Gateway** (`TSDBStatus` in `pkg/storegateway/gateway.go`): Locates the tenant's `BucketStore`, iterates over the tenant's loaded blocks, and computes cardinality statistics from block indexes (see [Block Index Cardinality Computation](#block-index-cardinality-computation)).
+4. **Aggregation**: The querier merges responses from all store gateways and returns the combined result.
+
 ### gRPC Definition
 
+#### Ingester Service
+
 A new `TSDBStatus` RPC is added to the Ingester service in `pkg/ingester/client/ingester.proto`:
 
 ```protobuf
@@ -85,10 +106,40 @@ message TSDBStatItem {
 }
 ```
 
+#### Store Gateway Service
+
+A new `TSDBStatus` RPC is added to the StoreGateway service in `pkg/storegateway/storegatewaypb/gateway.proto`:
+
+```protobuf
+rpc TSDBStatus(TSDBStatusRequest) returns (TSDBStatusResponse) {};
+
+message TSDBStatusRequest {
+  int32 limit = 1;
+}
+
+message TSDBStatusResponse {
+  uint64 num_series = 1;
+  int64 min_time = 2;
+  int64 max_time = 3;
+  repeated TSDBStatItem series_count_by_metric_name = 4;
+  repeated TSDBStatItem label_value_count_by_label_name = 5;
+  repeated TSDBStatItem series_count_by_label_value_pair = 6;
+}
+
+message TSDBStatItem {
+  string name = 1;
+  uint64 value = 2;
+}
+```
+
+The store gateway response omits `numLabelPairs` and `memoryInBytesByLabelName` because these fields are specific to the in-memory TSDB head (see [Response Format](#response-format) for details).
+
 ### Aggregation Logic
 
 Because each series is replicated across multiple ingesters (controlled by the replication factor), the aggregation logic must account for this when merging responses:
 
+#### Head Path Aggregation
+
 | Field | Aggregation Strategy |
 |---|---|
 | `numSeries` | Sum across ingesters, divide by replication factor |
@@ -102,9 +153,77 @@ Because each series is replicated across multiple ingesters (controlled by the r
 
 The `topNStats` helper function handles the sort-and-truncate step: it divides values by the replication factor, sorts descending by value, and returns the top N items.
 
+#### Blocks Path Aggregation
+
+Store gateways use the store gateway ring for replication, so different store gateways may serve the same blocks. The aggregation handles this differently from ingesters:
+
+| Field | Aggregation Strategy |
+|---|---|
+| `numSeries` | Sum across store gateways, divide by store gateway replication factor |
+| `minTime` | Minimum across all store gateways |
+| `maxTime` | Maximum across all store gateways |
+| `seriesCountByMetricName` | Sum per metric, divide by SG RF, return top N |
+| `labelValueCountByLabelName` | Maximum per label |
+| `seriesCountByLabelValuePair` | Sum per pair, divide by SG RF, return top N |
+
+**Note on block overlap**: Before compaction completes, a tenant may have multiple blocks covering the same time range. Series that appear in overlapping blocks within a single store gateway are counted once per block they appear in, so the `numSeries` total may overcount compared to the true unique series count. This is an acceptable approximation — the primary use case is identifying which metrics and label-value pairs contribute the most cardinality, not producing an exact total.
+
+### Block Index Cardinality Computation
+
+The store gateway computes cardinality statistics from the block indexes already loaded for the tenant's `BucketStore`. Each block has an `indexheader.Reader` (memory-mapped binary index header) that provides cheap access to label metadata, and optionally a full `index.Reader` for posting list expansion.
+
+The three cardinality dimensions have different cost profiles:
+
+#### 1. Label Value Count by Label Name (Cheap)
+
+This is computed entirely from the index header, with no object storage I/O:
+
+```go
+labelNames, _ := indexHeaderReader.LabelNames()
+for _, name := range labelNames {
+    values, _ := indexHeaderReader.LabelValues(name)
+    // len(values) = number of distinct values for this label
+}
+```
+
+The index header stores label name → label value → posting offset mappings in memory. Calling `LabelValues()` returns the distinct values directly. Across multiple blocks, the values are merged (set union) to produce the total distinct count per label.
+
+#### 2. Series Count by Metric Name (Moderate)
+
+To count the number of series per `__name__` value, we must determine the size of each posting list. Two approaches:
+
+**Option A — Posting list expansion**: For each metric name, call `ExpandedPostings(ctx, "__name__", metricName)` on the full block index to get the posting list (series IDs). The list length equals the series count. This requires fetching posting list data from object storage.
+
+**Option B — Posting offset estimation**: The index header stores the byte offset of each posting list in the index file. The byte length between consecutive posting offsets provides an estimate of the posting list size. Since posting lists are varint-encoded series IDs, the relationship between byte size and series count is approximately proportional. This avoids object storage I/O entirely but produces estimates rather than exact counts.
+
+**Recommendation**: Use Option A (posting list expansion) for the `__name__` label only. The number of distinct metric names is typically bounded (hundreds to low thousands), making the cost manageable. Results should be cached per block since compacted blocks are immutable (see [Caching](#caching)).
+
+#### 3. Series Count by Label-Value Pair (Expensive)
+
+This requires expanding posting lists for every label=value combination, which is an order of magnitude more expensive than metric-name-only expansion. For a tenant with 100 label names and 1,000 values each, this means 100,000 posting list lookups.
+
+**Recommendation**: This field is computed on-demand using Option A (posting list expansion). To bound the cost:
+- Only expand posting lists for the top N label names by value count (already known from step 1).
+- Within each label name, only expand posting lists for a bounded number of values.
+- Apply a per-request timeout so that very high-cardinality tenants get partial results rather than unbounded computation.
+
+Results are cached per block (see [Caching](#caching)).
+
+#### Block Selection
+
+By default, the store gateway computes cardinality across all blocks it holds for the tenant. This represents the full long-term storage cardinality view. A future enhancement could add `min_time` / `max_time` query parameters to restrict the analysis to a specific time range.
+
+#### Caching
+
+Compacted blocks are immutable — once a block is written to object storage, its contents never change. This means cardinality statistics computed from a block's index can be cached indefinitely (until the block is deleted by the compactor). Each store gateway maintains a per-block cardinality cache keyed by `(block ULID, limit)`. This cache eliminates redundant index traversals when the endpoint is called repeatedly.
+
+The cache is populated on first request and invalidated when blocks are removed during compaction syncs.
+
 ### Response Format
 
-The JSON response uses a flat structure for head statistics:
+The JSON response uses a flat structure. The fields returned depend on the `source` parameter.
+
+#### Head Response (`source=head`)
 
 ```json
 {
@@ -131,6 +250,35 @@ The JSON response uses a flat structure for head statistics:
 }
 ```
 
+#### Blocks Response (`source=blocks`)
+
+```json
+{
+  "numSeries": 125000,
+  "minTime": 1704067200000,
+  "maxTime": 1709740800000,
+  "seriesCountByMetricName": [
+    {"name": "http_requests_total", "value": 45000},
+    {"name": "process_cpu_seconds_total", "value": 18000}
+  ],
+  "labelValueCountByLabelName": [
+    {"name": "instance", "value": 2500},
+    {"name": "job", "value": 85}
+  ],
+  "seriesCountByLabelValuePair": [
+    {"name": "job=api-server", "value": 22000},
+    {"name": "instance=host1:9090", "value": 8500}
+  ]
+}
+```
+
+The blocks response omits two head-specific fields:
+
+| Field | Why omitted from blocks |
+|---|---|
+| `numLabelPairs` | This count comes from `MemPostings` which tracks label pairs in memory. Block indexes do not maintain an equivalent aggregate count. |
+| `memoryInBytesByLabelName` | This measures in-memory byte usage of label data in the ingester's TSDB head. It has no meaningful analogue in object storage — block indexes are memory-mapped and the on-disk size of label data depends on index encoding, not runtime memory. |
+
 ### API Compatibility with Prometheus
 
 The response format intentionally diverges from the upstream Prometheus `/api/v1/status/tsdb` endpoint in two ways:
@@ -141,21 +289,19 @@ The response format intentionally diverges from the upstream Prometheus `/api/v1
 
 **Open question**: Should we adopt the `headStats` wrapper to maintain client compatibility with Prometheus tooling? The trade-off is compatibility vs simplicity — the flat format is easier to consume for Cortex-specific clients, but adopting the Prometheus format would allow reuse of existing client libraries.
 
-### Extensibility to Long-Term Storage
-
-Some fields in the response are inherently specific to the in-memory TSDB head and would not translate to a long-term storage cardinality API:
+### Field Portability Between Sources
 
-| Field | Head-specific? | Notes |
-|---|---|---|
-| `seriesCountByMetricName` | No | Portable to block storage |
-| `labelValueCountByLabelName` | No | Portable to block storage |
-| `seriesCountByLabelValuePair` | No | Portable to block storage |
-| `memoryInBytesByLabelName` | **Yes** | In-memory byte usage has no analogue in object storage |
-| `minTime` / `maxTime` | **Yes** | Reflects head time range, not total storage |
-| `numSeries` | Partially | Head-only count; block storage would have a different count |
-| `numLabelPairs` | Partially | Head-only count |
+Some fields are shared across both sources, while others are source-specific:
 
-If a long-term storage cardinality API is added in the future, the portable fields (`seriesCountByMetricName`, `labelValueCountByLabelName`, `seriesCountByLabelValuePair`) could share a common response format. Head-specific fields like `memoryInBytesByLabelName` would remain scoped to this endpoint. This could be achieved by either adding a `source=head|blocks` query parameter to this endpoint or introducing a separate endpoint for block storage cardinality.
+| Field | `source=head` | `source=blocks` | Notes |
+|---|---|---|---|
+| `seriesCountByMetricName` | Yes | Yes | Core cardinality diagnostic |
+| `labelValueCountByLabelName` | Yes | Yes | Core cardinality diagnostic |
+| `seriesCountByLabelValuePair` | Yes | Yes | Core cardinality diagnostic |
+| `numSeries` | Yes | Yes | Approximate for blocks due to overlap |
+| `minTime` / `maxTime` | Yes | Yes | Head time range vs block time range |
+| `memoryInBytesByLabelName` | Yes | No | In-memory byte usage, head-specific |
+| `numLabelPairs` | Yes | No | `MemPostings`-specific count |
 
 ### Multi-Tenancy
 
@@ -163,7 +309,7 @@ Tenant isolation is enforced through the existing Cortex authentication middlewa
 
 ## Design Alternatives
 
-### Distributor vs Querier Routing
+### Distributor vs Querier Routing (Head Path)
 
 This design routes the endpoint through the **Querier**, which handles the HTTP request and delegates to the in-process Distributor for ingester fan-out and aggregation. An alternative is to route through the **Distributor** directly.
 
@@ -178,15 +324,62 @@ This design routes the endpoint through the **Querier**, which handles the HTTP
 
 Note that both approaches have the same number of network hops. Even in microservices mode, the Querier process initializes an in-process Distributor instance via the `DistributorService` module (a `UserInvisibleModule` dependency of `Queryable`). This in-process Distributor holds its own ingester client pool and connects directly to ingesters via gRPC. The choice between Querier and Distributor routing only affects which process serves the HTTP request, not the number of network hops.
 
+### Posting List Expansion vs Offset Estimation (Blocks Path)
+
+For computing series counts from block indexes, two strategies were considered:
+
+**Posting list expansion (chosen):**
+- Fetches and decodes the posting list for each label value from the full block index.
+- Produces exact series counts.
+- Requires object storage I/O on first access (cached thereafter).
+- Cost is proportional to the number of distinct metric names (typically manageable).
+
+**Posting offset estimation:**
+- Uses byte offsets between consecutive entries in the posting offset table (available in the index header) to estimate posting list sizes.
+- No object storage I/O required — uses only the memory-mapped index header.
+- Produces approximate counts since varint encoding means byte size is not directly proportional to series count.
+- Would require calibration or a conversion factor that varies by data characteristics.
+
+Posting list expansion was chosen because exact counts are more useful for cardinality debugging, and the per-block caching strategy (blocks are immutable) amortizes the object storage I/O cost over repeated requests.
+
+### Compactor-Based Precomputation (Blocks Path)
+
+An alternative to on-demand computation is precomputing cardinality statistics during compaction and storing them in block `meta.json`:
+
+**Pros:**
+- Zero read-time cost — statistics are available immediately from block metadata.
+- The compactor already reads the full block index during compaction and validation (`GatherIndexHealthStats`).
+
+**Cons:**
+- Statistics are only available after compaction runs. Freshly uploaded blocks from ingesters would have no cardinality data until the next compaction cycle.
+- Increases `meta.json` size. With hundreds of metric names and label-value pairs, the cardinality data could be significant.
+- The `limit` parameter cannot be applied at precomputation time — either store all data or pick a fixed limit.
+- Adds complexity to the compaction pipeline for a diagnostic feature.
+
+The on-demand approach was chosen because it works for all blocks immediately (not just compacted ones) and allows flexible `limit` parameters per request.
+
 ## Implementation
 
 The implementation spans the following key files:
 
+### Head Path (Ingester)
+
 - `pkg/api/handlers.go` - Route registration in `NewQuerierHandler`
 - `pkg/querier/tsdb_status_handler.go` - HTTP handler (`TSDBStatusHandler`)
 - `pkg/querier/distributor_queryable.go` - `TSDBStatus` added to the Distributor interface
 - `pkg/distributor/distributor.go` - Fan-out to ingesters and aggregation logic (`TSDBStatus`, `topNStats`)
 - `pkg/ingester/ingester.go` - Per-tenant TSDB head stats retrieval (`TSDBStatus`, `statsToPB`)
 - `pkg/ingester/client/ingester.proto` - gRPC message definitions (`TSDBStatusRequest`, `TSDBStatusResponse`, `TSDBStatItem`)
-- `docs/api/_index.md` - API documentation
-- `integration/api_endpoints_test.go` - Integration tests
+
+### Blocks Path (Store Gateway)
+
+- `pkg/querier/tsdb_status_handler.go` - HTTP handler routes `source=blocks` to store gateway path
+- `pkg/querier/blocks_store_queryable.go` - `TSDBStatus` added to the store gateway query interface
+- `pkg/storegateway/storegatewaypb/gateway.proto` - gRPC message definitions for store gateway `TSDBStatus` RPC
+- `pkg/storegateway/gateway.go` - Store gateway `TSDBStatus` handler, delegates to `ThanosBucketStores`
+- `pkg/storegateway/bucket_stores.go` - Per-tenant block iteration and cardinality computation from index headers and block indexes
+
+### Shared
+
+- `docs/api/_index.md` - API documentation (updated with `source` parameter)
+- `integration/api_endpoints_test.go` - Integration tests for both head and blocks paths

From e9782c4b8b0bb50347e932b92110bac41467784e Mon Sep 17 00:00:00 2001
From: Charlie Le <charlie_le@apple.com>
Date: Fri, 13 Mar 2026 12:50:27 -0700
Subject: [PATCH 3/6] Update proposal based on PR review: rename to Cardinality
 API and simplify

Address feedback from PR #7335 review:
- Rename endpoint from /api/v1/status/tsdb to /api/v1/cardinality
- Drop Prometheus compatibility as a goal
- Add start/end time range query parameters
- Drop head-specific fields (numLabelPairs, memoryInBytesByLabelName,
  minTime, maxTime) to unify response across both sources
- Remove API Compatibility and Field Portability sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
---
 docs/proposals/per-tenant-tsdb-status-api.md | 120 ++++---------------
 1 file changed, 25 insertions(+), 95 deletions(-)

diff --git a/docs/proposals/per-tenant-tsdb-status-api.md b/docs/proposals/per-tenant-tsdb-status-api.md
index a9f88fc0732..a0036a49636 100644
--- a/docs/proposals/per-tenant-tsdb-status-api.md
+++ b/docs/proposals/per-tenant-tsdb-status-api.md
@@ -1,8 +1,8 @@
 ---
-title: "Per-Tenant TSDB Status API"
-linkTitle: "Per-Tenant TSDB Status API"
+title: "Per-Tenant Cardinality API"
+linkTitle: "Per-Tenant Cardinality API"
 weight: 1
-slug: per-tenant-tsdb-status-api
+slug: per-tenant-cardinality-api
 ---
 
 - Author: [Charlie Le](https://github.com/CharlieTLe)
@@ -15,17 +15,16 @@ High-cardinality series is one of the most common operational challenges for Pro
 
 Currently, Cortex tenants lack visibility into which metrics, labels, and label-value pairs contribute the most series in ingesters. Without this information, debugging high-cardinality issues requires operators to inspect TSDB internals directly on ingester instances, which is impractical in a multi-tenant, distributed environment.
 
-Prometheus itself exposes a `/api/v1/status/tsdb` endpoint that provides cardinality statistics from the TSDB head. This proposal brings equivalent functionality to Cortex as a multi-tenant, distributed API.
+This proposal introduces a dedicated cardinality API for Cortex that works across both in-memory ingester data and long-term block storage.
 
 ## Goal
 
 Expose per-tenant cardinality statistics via a REST API endpoint on the Cortex query path. The endpoint should:
 
-1. Be compatible with the Prometheus `/api/v1/status/tsdb` response format.
-2. Support two data sources: in-memory TSDB head data from ingesters and compacted blocks from long-term object storage via store gateways.
-3. Aggregate statistics across all ingesters or store gateways that hold data for the requesting tenant.
-4. Correctly account for replication factor when summing series counts and memory usage.
-5. Respect multi-tenancy, ensuring tenants can only see their own data.
+1. Support two data sources: in-memory TSDB head data from ingesters and compacted blocks from long-term object storage via store gateways.
+2. Aggregate statistics across all ingesters or store gateways that hold data for the requesting tenant.
+3. Correctly account for replication factor when summing series counts.
+4. Respect multi-tenancy, ensuring tenants can only see their own data.
 
 ## Out of Scope
 
@@ -37,14 +36,19 @@ Expose per-tenant cardinality statistics via a REST API endpoint on the Cortex q
 ### Endpoint
 
 ```
-GET /api/v1/status/tsdb?limit=N&source=head|blocks
+GET /api/v1/cardinality?limit=N&source=head|blocks&start=T&end=T
 ```
 
 - **Authentication**: Requires `X-Scope-OrgID` header (standard Cortex tenant authentication).
 - **Query Parameters**:
   - `limit` (optional, default 10) - controls the number of top items returned per category.
   - `source` (optional, default `head`) - selects the data source. `head` queries ingester TSDB heads, `blocks` queries compacted blocks in long-term storage via store gateways.
-- **Legacy Path**: Also registered at `<legacy-prefix>/api/v1/status/tsdb`.
+  - `start` (optional, RFC3339 or Unix timestamp) - start of the time range to analyze.
+  - `end` (optional, RFC3339 or Unix timestamp) - end of the time range to analyze.
+- **Time Range Behavior**:
+  - **Blocks path**: Only blocks whose time range overlaps with `[start, end]` are analyzed. A block is included if its `minTime < end` and its `maxTime > start`.
+  - **Head path**: Head stats are included if the head's time range overlaps with `[start, end]`. The TSDB head does not support sub-range cardinality filtering, so when included, stats reflect the full head.
+  - When `start` and `end` are omitted, all available data is included.
 
 ### Architecture
 
@@ -91,13 +95,9 @@ message TSDBStatusRequest {
 
 message TSDBStatusResponse {
   uint64 num_series = 1;
-  int64 min_time = 2;
-  int64 max_time = 3;
-  int32 num_label_pairs = 4;
-  repeated TSDBStatItem series_count_by_metric_name = 5;
-  repeated TSDBStatItem label_value_count_by_label_name = 6;
-  repeated TSDBStatItem memory_in_bytes_by_label_name = 7;
-  repeated TSDBStatItem series_count_by_label_value_pair = 8;
+  repeated TSDBStatItem series_count_by_metric_name = 2;
+  repeated TSDBStatItem label_value_count_by_label_name = 3;
+  repeated TSDBStatItem series_count_by_label_value_pair = 4;
 }
 
 message TSDBStatItem {
@@ -119,11 +119,9 @@ message TSDBStatusRequest {
 
 message TSDBStatusResponse {
   uint64 num_series = 1;
-  int64 min_time = 2;
-  int64 max_time = 3;
-  repeated TSDBStatItem series_count_by_metric_name = 4;
-  repeated TSDBStatItem label_value_count_by_label_name = 5;
-  repeated TSDBStatItem series_count_by_label_value_pair = 6;
+  repeated TSDBStatItem series_count_by_metric_name = 2;
+  repeated TSDBStatItem label_value_count_by_label_name = 3;
+  repeated TSDBStatItem series_count_by_label_value_pair = 4;
 }
 
 message TSDBStatItem {
@@ -132,7 +130,7 @@ message TSDBStatItem {
 }
 ```
 
-The store gateway response omits `numLabelPairs` and `memoryInBytesByLabelName` because these fields are specific to the in-memory TSDB head (see [Response Format](#response-format) for details).
+Both the ingester and store gateway response messages share the same fields.
 
 ### Aggregation Logic
 
@@ -143,12 +141,8 @@ Because each series is replicated across multiple ingesters (controlled by the r
 | Field | Aggregation Strategy |
 |---|---|
 | `numSeries` | Sum across ingesters, divide by replication factor |
-| `minTime` | Minimum across all ingesters |
-| `maxTime` | Maximum across all ingesters |
-| `numLabelPairs` | Maximum across ingesters |
 | `seriesCountByMetricName` | Sum per metric, divide by RF, return top N |
 | `labelValueCountByLabelName` | Maximum per label (unique counts, not affected by replication) |
-| `memoryInBytesByLabelName` | Sum per label, divide by RF, return top N |
 | `seriesCountByLabelValuePair` | Sum per pair, divide by RF, return top N |
 
 The `topNStats` helper function handles the sort-and-truncate step: it divides values by the replication factor, sorts descending by value, and returns the top N items.
@@ -160,8 +154,6 @@ Store gateways use the store gateway ring for replication, so different store ga
 | Field | Aggregation Strategy |
 |---|---|
 | `numSeries` | Sum across store gateways, divide by store gateway replication factor |
-| `minTime` | Minimum across all store gateways |
-| `maxTime` | Maximum across all store gateways |
 | `seriesCountByMetricName` | Sum per metric, divide by SG RF, return top N |
 | `labelValueCountByLabelName` | Maximum per label |
 | `seriesCountByLabelValuePair` | Sum per pair, divide by SG RF, return top N |
@@ -211,7 +203,7 @@ Results are cached per block (see [Caching](#caching)).
 
 #### Block Selection
 
-By default, the store gateway computes cardinality across all blocks it holds for the tenant. This represents the full long-term storage cardinality view. A future enhancement could add `min_time` / `max_time` query parameters to restrict the analysis to a specific time range.
+When the `start` and `end` query parameters are provided, the store gateway filters blocks based on time range overlap: a block is included only if its `minTime < end` and its `maxTime > start`. This allows users to scope cardinality analysis to a specific time window, such as the last 24 hours or a particular incident period. When `start` and `end` are omitted, all blocks for the tenant are included.
 
 #### Caching
 
@@ -221,16 +213,11 @@ The cache is populated on first request and invalidated when blocks are removed
 
 ### Response Format
 
-The JSON response uses a flat structure. The fields returned depend on the `source` parameter.
-
-#### Head Response (`source=head`)
+The JSON response uses a flat structure. Both `source=head` and `source=blocks` return the same fields:
 
 ```json
 {
   "numSeries": 1500,
-  "minTime": 1709740800000,
-  "maxTime": 1709748000000,
-  "numLabelPairs": 42,
   "seriesCountByMetricName": [
     {"name": "http_requests_total", "value": 500},
     {"name": "process_cpu_seconds_total", "value": 200}
@@ -239,10 +226,6 @@ The JSON response uses a flat structure. The fields returned depend on the `sour
     {"name": "instance", "value": 50},
     {"name": "job", "value": 10}
   ],
-  "memoryInBytesByLabelName": [
-    {"name": "instance", "value": 25600},
-    {"name": "job", "value": 5120}
-  ],
   "seriesCountByLabelValuePair": [
     {"name": "job=api-server", "value": 300},
     {"name": "instance=host1:9090", "value": 150}
@@ -250,59 +233,6 @@ The JSON response uses a flat structure. The fields returned depend on the `sour
 }
 ```
 
-#### Blocks Response (`source=blocks`)
-
-```json
-{
-  "numSeries": 125000,
-  "minTime": 1704067200000,
-  "maxTime": 1709740800000,
-  "seriesCountByMetricName": [
-    {"name": "http_requests_total", "value": 45000},
-    {"name": "process_cpu_seconds_total", "value": 18000}
-  ],
-  "labelValueCountByLabelName": [
-    {"name": "instance", "value": 2500},
-    {"name": "job", "value": 85}
-  ],
-  "seriesCountByLabelValuePair": [
-    {"name": "job=api-server", "value": 22000},
-    {"name": "instance=host1:9090", "value": 8500}
-  ]
-}
-```
-
-The blocks response omits two head-specific fields:
-
-| Field | Why omitted from blocks |
-|---|---|
-| `numLabelPairs` | This count comes from `MemPostings` which tracks label pairs in memory. Block indexes do not maintain an equivalent aggregate count. |
-| `memoryInBytesByLabelName` | This measures in-memory byte usage of label data in the ingester's TSDB head. It has no meaningful analogue in object storage — block indexes are memory-mapped and the on-disk size of label data depends on index encoding, not runtime memory. |
-
-### API Compatibility with Prometheus
-
-The response format intentionally diverges from the upstream Prometheus `/api/v1/status/tsdb` endpoint in two ways:
-
-1. **Flat structure vs nested `headStats`**: Prometheus wraps `numSeries`, `numLabelPairs`, `chunkCount`, `minTime`, and `maxTime` inside a `headStats` object. This proposal uses a flat structure at the top level instead, which is simpler for consumers but means existing Prometheus client libraries cannot parse the response directly.
-
-2. **`chunkCount` omitted**: Prometheus includes a `chunkCount` field (from `prometheus_tsdb_head_chunks`). In a distributed system with replication, chunk counts across ingesters cannot be meaningfully aggregated — chunks are an ingester-local storage detail, and summing/dividing by the replication factor does not produce a useful number.
-
-**Open question**: Should we adopt the `headStats` wrapper to maintain client compatibility with Prometheus tooling? The trade-off is compatibility vs simplicity — the flat format is easier to consume for Cortex-specific clients, but adopting the Prometheus format would allow reuse of existing client libraries.
-
-### Field Portability Between Sources
-
-Some fields are shared across both sources, while others are source-specific:
-
-| Field | `source=head` | `source=blocks` | Notes |
-|---|---|---|---|
-| `seriesCountByMetricName` | Yes | Yes | Core cardinality diagnostic |
-| `labelValueCountByLabelName` | Yes | Yes | Core cardinality diagnostic |
-| `seriesCountByLabelValuePair` | Yes | Yes | Core cardinality diagnostic |
-| `numSeries` | Yes | Yes | Approximate for blocks due to overlap |
-| `minTime` / `maxTime` | Yes | Yes | Head time range vs block time range |
-| `memoryInBytesByLabelName` | Yes | No | In-memory byte usage, head-specific |
-| `numLabelPairs` | Yes | No | `MemPostings`-specific count |
-
 ### Multi-Tenancy
 
 Tenant isolation is enforced through the existing Cortex authentication middleware. The `X-Scope-OrgID` header identifies the tenant, and the ingester only returns statistics from that tenant's TSDB head. No cross-tenant data leakage is possible because each tenant has a separate TSDB instance in the ingester.
@@ -381,5 +311,5 @@ The implementation spans the following key files:
 
 ### Shared
 
-- `docs/api/_index.md` - API documentation (updated with `source` parameter)
+- `docs/api/_index.md` - API documentation (updated with `source`, `start`, and `end` parameters)
 - `integration/api_endpoints_test.go` - Integration tests for both head and blocks paths

From 87fe63e87c4bf618933e7dc6c8e32e697dc827fc Mon Sep 17 00:00:00 2001
From: Charlie Le <charlie_le@apple.com>
Date: Fri, 13 Mar 2026 13:11:18 -0700
Subject: [PATCH 4/6] Require start/end for blocks path and add per-tenant max
 query range limit

Make start/end required for source=blocks to prevent unbounded block
scanning. Add cardinality_max_query_range per-tenant limit (default 24h)
to give operators control over the blast radius.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
---
 docs/proposals/per-tenant-tsdb-status-api.md | 21 ++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/docs/proposals/per-tenant-tsdb-status-api.md b/docs/proposals/per-tenant-tsdb-status-api.md
index a0036a49636..0d70a640568 100644
--- a/docs/proposals/per-tenant-tsdb-status-api.md
+++ b/docs/proposals/per-tenant-tsdb-status-api.md
@@ -43,12 +43,11 @@ GET /api/v1/cardinality?limit=N&source=head|blocks&start=T&end=T
 - **Query Parameters**:
   - `limit` (optional, default 10) - controls the number of top items returned per category.
   - `source` (optional, default `head`) - selects the data source. `head` queries ingester TSDB heads, `blocks` queries compacted blocks in long-term storage via store gateways.
-  - `start` (optional, RFC3339 or Unix timestamp) - start of the time range to analyze.
-  - `end` (optional, RFC3339 or Unix timestamp) - end of the time range to analyze.
+  - `start` (RFC3339 or Unix timestamp) - start of the time range to analyze. Required for `source=blocks`, optional for `source=head`.
+  - `end` (RFC3339 or Unix timestamp) - end of the time range to analyze. Required for `source=blocks`, optional for `source=head`.
 - **Time Range Behavior**:
-  - **Blocks path**: Only blocks whose time range overlaps with `[start, end]` are analyzed. A block is included if its `minTime < end` and its `maxTime > start`.
-  - **Head path**: Head stats are included if the head's time range overlaps with `[start, end]`. The TSDB head does not support sub-range cardinality filtering, so when included, stats reflect the full head.
-  - When `start` and `end` are omitted, all available data is included.
+  - **Blocks path**: `start` and `end` are required. Only blocks whose time range overlaps with `[start, end]` are analyzed. A block is included if its `minTime < end` and its `maxTime > start`. The requested time range must not exceed the per-tenant `cardinality_max_query_range` limit (see [Per-Tenant Limits](#per-tenant-limits)).
+  - **Head path**: `start` and `end` are optional. When provided, head stats are included only if the head's time range overlaps with `[start, end]`. The TSDB head does not support sub-range cardinality filtering, so when included, stats reflect the full head. When omitted, head stats are always included.
 
 ### Architecture
 
@@ -203,7 +202,7 @@ Results are cached per block (see [Caching](#caching)).
 
 #### Block Selection
 
-When the `start` and `end` query parameters are provided, the store gateway filters blocks based on time range overlap: a block is included only if its `minTime < end` and its `maxTime > start`. This allows users to scope cardinality analysis to a specific time window, such as the last 24 hours or a particular incident period. When `start` and `end` are omitted, all blocks for the tenant are included.
+The store gateway filters blocks based on the required `start` and `end` parameters: a block is included only if its `minTime < end` and its `maxTime > start`. This scopes cardinality analysis to a specific time window, such as the last 24 hours or a particular incident period.
 
 #### Caching
 
@@ -237,6 +236,16 @@ The JSON response uses a flat structure. Both `source=head` and `source=blocks`
 
 Tenant isolation is enforced through the existing Cortex authentication middleware. The `X-Scope-OrgID` header identifies the tenant, and the ingester only returns statistics from that tenant's TSDB head. No cross-tenant data leakage is possible because each tenant has a separate TSDB instance in the ingester.
 
+### Per-Tenant Limits
+
+To prevent expensive cardinality queries from overloading store gateways, the following per-tenant runtime-configurable limit is introduced:
+
+| Limit | Flag | YAML | Default | Description |
+|---|---|---|---|---|
+| `cardinality_max_query_range` | `-querier.cardinality-max-query-range` | `cardinality_max_query_range` | `24h` | Maximum allowed time range (`end - start`) for `source=blocks` cardinality queries. |
+
+When a `source=blocks` request exceeds this limit, the endpoint returns HTTP 422 with an error message indicating the maximum allowed range. This gives operators control over the blast radius per tenant — high-value tenants can be granted wider windows while keeping a safe default for everyone else.
+
 ## Design Alternatives
 
 ### Distributor vs Querier Routing (Head Path)

From c4bac7f1cf0895f0329fecc08d76b014f3a74558 Mon Sep 17 00:00:00 2001
From: Charlie Le <charlie_le@apple.com>
Date: Fri, 13 Mar 2026 13:38:14 -0700
Subject: [PATCH 5/6] Address all review findings from proposal review

Critical:
- Fix blocks path aggregation: no SG RF division since GetClientsFor
  routes each block to exactly one store gateway

Significant:
- Add min_time, max_time, block_ids to store gateway CardinalityRequest
- Specify MaxErrors=0 for head path with availability implications
- Add consistency check and retry logic for blocks path
- Document RF division as best-effort approximation

Moderate:
- Wrap responses in standard {status, data} Prometheus envelope
- Change HTTP 422 to HTTP 400 for limit violations
- Add Error Responses section with all validation scenarios
- Add approximated field for block overlap and partial results
- Add Observability section with metrics
- Add per-tenant concurrency limit and query timeout
- Reject start/end for source=head instead of silently ignoring

Low:
- Add Rollout Plan with phased approach and feature flag
- Document rolling upgrade compatibility (Unimplemented handling)
- Document Query Frontend bypass
- Improve caching: full results keyed by ULID, limit at response time
- Add missing files to implementation section
- Move shared proto to pkg/cortexpb/cardinality.proto
- Rename TSDBStatus* to Cardinality* throughout
- Add limit upper bound (max 512)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
---
 docs/proposals/per-tenant-tsdb-status-api.md | 245 +++++++++++++------
 1 file changed, 164 insertions(+), 81 deletions(-)

diff --git a/docs/proposals/per-tenant-tsdb-status-api.md b/docs/proposals/per-tenant-tsdb-status-api.md
index 0d70a640568..ec94027cbe9 100644
--- a/docs/proposals/per-tenant-tsdb-status-api.md
+++ b/docs/proposals/per-tenant-tsdb-status-api.md
@@ -23,7 +23,7 @@ Expose per-tenant cardinality statistics via a REST API endpoint on the Cortex q
 
 1. Support two data sources: in-memory TSDB head data from ingesters and compacted blocks from long-term object storage via store gateways.
 2. Aggregate statistics across all ingesters or store gateways that hold data for the requesting tenant.
-3. Correctly account for replication factor when summing series counts.
+3. Correctly account for replication factor when summing series counts from ingesters.
 4. Respect multi-tenancy, ensuring tenants can only see their own data.
 
 ## Out of Scope
@@ -36,22 +36,45 @@ Expose per-tenant cardinality statistics via a REST API endpoint on the Cortex q
 ### Endpoint
 
 ```
-GET /api/v1/cardinality?limit=N&source=head|blocks&start=T&end=T
+GET <prometheus-http-prefix>/api/v1/cardinality?limit=N&source=head|blocks&start=T&end=T
 ```
 
+The endpoint is also registered at `<legacy-http-prefix>/api/v1/cardinality`, following the pattern used by other querier endpoints.
+
 - **Authentication**: Requires `X-Scope-OrgID` header (standard Cortex tenant authentication).
 - **Query Parameters**:
-  - `limit` (optional, default 10) - controls the number of top items returned per category.
-  - `source` (optional, default `head`) - selects the data source. `head` queries ingester TSDB heads, `blocks` queries compacted blocks in long-term storage via store gateways.
-  - `start` (RFC3339 or Unix timestamp) - start of the time range to analyze. Required for `source=blocks`, optional for `source=head`.
-  - `end` (RFC3339 or Unix timestamp) - end of the time range to analyze. Required for `source=blocks`, optional for `source=head`.
-- **Time Range Behavior**:
-  - **Blocks path**: `start` and `end` are required. Only blocks whose time range overlaps with `[start, end]` are analyzed. A block is included if its `minTime < end` and its `maxTime > start`. The requested time range must not exceed the per-tenant `cardinality_max_query_range` limit (see [Per-Tenant Limits](#per-tenant-limits)).
-  - **Head path**: `start` and `end` are optional. When provided, head stats are included only if the head's time range overlaps with `[start, end]`. The TSDB head does not support sub-range cardinality filtering, so when included, stats reflect the full head. When omitted, head stats are always included.
+  - `limit` (optional, default 10, max 512) - controls the number of top items returned per category. Values outside the `[1, 512]` range are rejected with HTTP 400.
+  - `source` (optional, default `head`) - selects the data source. `head` queries ingester TSDB heads, `blocks` queries compacted blocks in long-term storage via store gateways. Invalid values are rejected with HTTP 400.
+  - `start` (RFC3339 or Unix timestamp) - start of the time range to analyze. Required for `source=blocks`, not accepted for `source=head`.
+  - `end` (RFC3339 or Unix timestamp) - end of the time range to analyze. Required for `source=blocks`, not accepted for `source=head`.
+- **Time Range Behavior** (`source=blocks` only):
+  - `start` and `end` are required. Only blocks whose time range overlaps with `[start, end]` are analyzed. A block is included if its `minTime < end` and its `maxTime > start`.
+  - The requested time range (`end - start`) must not exceed the per-tenant `cardinality_max_query_range` limit (see [Per-Tenant Limits](#per-tenant-limits)).
+  - `start` must be before `end`; inverted ranges are rejected with HTTP 400.
+
+The head path does not accept `start`/`end` because the TSDB head cannot filter cardinality statistics by sub-range — it always returns stats for the full head. Rather than accept parameters that cannot be honored, the endpoint rejects them with HTTP 400.
+
+### Error Responses
+
+All error responses use the standard Prometheus API envelope:
+
+```json
+{"status": "error", "errorType": "bad_data", "error": "description of the problem"}
+```
+
+| Condition | HTTP Status | Error Message |
+|---|---|---|
+| Invalid `limit` (< 1, > 512, or non-integer) | 400 | `invalid limit: must be an integer between 1 and 512` |
+| Invalid `source` (not `head` or `blocks`) | 400 | `invalid source: must be "head" or "blocks"` |
+| `start`/`end` provided with `source=head` | 400 | `start and end parameters are not supported for source=head` |
+| `start` or `end` missing with `source=blocks` | 400 | `start and end are required for source=blocks` |
+| Malformed `start` or `end` | 400 | `invalid start/end: must be RFC3339 or Unix timestamp` |
+| `start >= end` | 400 | `invalid time range: start must be before end` |
+| Time range exceeds `cardinality_max_query_range` | 400 | `the query time range exceeds the limit (query length: %s, limit: %s)` |
 
 ### Architecture
 
-The HTTP handler parses the `source` parameter and delegates to the appropriate backend.
+The HTTP handler parses the `source` parameter and delegates to the appropriate backend. The endpoint is registered via `NewQuerierHandler` in `pkg/api/handlers.go` and does **not** go through the Query Frontend — it is served directly by the Querier. The Query Frontend's splitting, caching, and retry logic is designed for PromQL queries and does not apply to cardinality statistics. The Querier's own per-tenant concurrency limit provides sufficient request control (see [Per-Tenant Limits](#per-tenant-limits)).
 
 #### Head Path (`source=head`)
 
@@ -61,82 +84,91 @@ The request flows through the Querier's HTTP handler, which delegates to the in-
 Client → HTTP Handler (Querier) → In-process Distributor → gRPC Fan-out (Ingesters) → Aggregation (Distributor) → JSON Response
 ```
 
-1. **HTTP Handler** (`TSDBStatusHandler` in `pkg/querier/tsdb_status_handler.go`): Registered via `NewQuerierHandler` in `pkg/api/handlers.go`. Parses the `limit` query parameter and calls the distributor's `TSDBStatus` method.
-2. **Distributor Fan-out** (`TSDBStatus` in `pkg/distributor/distributor.go`): The Querier process holds an in-process Distributor instance (initialized via the `DistributorService` module). This instance uses `GetIngestersForMetadata` to discover all ingesters for the tenant, then sends a `TSDBStatusRequest` gRPC call to each ingester in the replication set.
-3. **Ingester** (`TSDBStatus` in `pkg/ingester/ingester.go`): Retrieves the tenant's TSDB head and calls `db.Head().Stats(labels.MetricName, limit)` to get cardinality statistics from the Prometheus TSDB library.
+1. **HTTP Handler** (`CardinalityHandler` in `pkg/querier/cardinality_handler.go`): Registered via `NewQuerierHandler` in `pkg/api/handlers.go`. Parses the `limit` query parameter and calls the distributor's `Cardinality` method.
+2. **Distributor Fan-out** (`Cardinality` in `pkg/distributor/distributor.go`): The Querier process holds an in-process Distributor instance (initialized via the `DistributorService` module). This instance uses `GetIngestersForMetadata` to discover all ingesters for the tenant, then sends a `CardinalityRequest` gRPC call to each ingester in the replication set with `MaxErrors = 0` — all ingesters must respond for the RF-based aggregation to be accurate. If any ingester in the tenant's replication set is unavailable, the request fails.
+3. **Ingester** (`Cardinality` in `pkg/ingester/ingester.go`): Retrieves the tenant's TSDB head and calls `db.Head().Stats(labels.MetricName, limit)` to get cardinality statistics from the Prometheus TSDB library.
 4. **Aggregation**: The distributor merges responses from all ingesters and returns the combined result.
 
+**Rolling upgrade compatibility**: During rolling deployments, some ingesters may not yet support the `Cardinality` RPC. The distributor treats `Unimplemented` gRPC errors from old ingesters the same as any other error — since `MaxErrors = 0`, the request fails with an HTTP 500 indicating that not all ingesters support the cardinality API. This is acceptable during the upgrade window.
+
 #### Blocks Path (`source=blocks`)
 
-The request flows through the Querier's HTTP handler, which fans out to store gateways:
+The request flows through the Querier's HTTP handler, which fans out to store gateways using the same `BlocksFinder` + `GetClientsFor` pattern used by `LabelNames`, `LabelValues`, and `Series`:
 
 ```
-Client → HTTP Handler (Querier) → gRPC Fan-out (Store Gateways) → Per-Tenant Block Index Analysis → Aggregation (Querier) → JSON Response
+Client → HTTP Handler (Querier) → BlocksFinder (discover blocks) → GetClientsFor (route blocks to SGs) → gRPC Fan-out → Aggregation (Querier) → JSON Response
 ```
 
-1. **HTTP Handler** (`TSDBStatusHandler` in `pkg/querier/tsdb_status_handler.go`): Parses `limit` and `source=blocks`, then calls the blocks store's `TSDBStatus` method.
-2. **Store Gateway Fan-out**: The Querier uses its existing store gateway client pool (`BlocksStoreSet`) to discover store gateways that hold blocks for the tenant, then sends a `TSDBStatus` gRPC call to each relevant store gateway instance.
-3. **Store Gateway** (`TSDBStatus` in `pkg/storegateway/gateway.go`): Locates the tenant's `BucketStore`, iterates over the tenant's loaded blocks, and computes cardinality statistics from block indexes (see [Block Index Cardinality Computation](#block-index-cardinality-computation)).
-4. **Aggregation**: The querier merges responses from all store gateways and returns the combined result.
+1. **HTTP Handler** (`CardinalityHandler` in `pkg/querier/cardinality_handler.go`): Parses `limit`, `start`, `end`, and `source=blocks`, then calls the blocks store's `Cardinality` method.
+2. **Block Discovery**: The Querier uses `BlocksFinder.GetBlocks()` to discover all blocks for the tenant within the `[start, end]` time range, then calls `GetClientsFor()` to route each block to exactly one store gateway instance. Each block is sent to a single store gateway — there is no broadcast to all replicas.
+3. **Store Gateway** (`Cardinality` in `pkg/storegateway/gateway.go`): Receives a request with specific block IDs. Locates the tenant's `BucketStore`, iterates over the specified blocks, and computes cardinality statistics from block indexes (see [Block Index Cardinality Computation](#block-index-cardinality-computation)).
+4. **Consistency Check**: After receiving responses, the querier runs `BlocksConsistencyChecker.Check()` to detect missing blocks. If blocks are missing (e.g., a store gateway hasn't loaded a recently uploaded block), the querier retries those blocks on different store gateway replicas, up to 3 attempts. If blocks remain missing after all retries, the response includes partial results.
+5. **Aggregation**: The querier merges responses from all store gateways and returns the combined result.
+
+**Rolling upgrade compatibility**: During rolling deployments, store gateways that do not yet support the `Cardinality` RPC return `Unimplemented` errors. The querier retries affected blocks on other replicas. If no replica supports the RPC, those blocks are treated as missing and the response is partial.
 
 ### gRPC Definition
 
+Shared protobuf messages are defined in `pkg/cortexpb/cardinality.proto` and imported by both the ingester and store gateway protos:
+
+```protobuf
+// pkg/cortexpb/cardinality.proto
+
+message CardinalityStatItem {
+  string name = 1;
+  uint64 value = 2;
+}
+```
+
 #### Ingester Service
 
-A new `TSDBStatus` RPC is added to the Ingester service in `pkg/ingester/client/ingester.proto`:
+A new `Cardinality` RPC is added to the Ingester service in `pkg/ingester/client/ingester.proto`:
 
 ```protobuf
-rpc TSDBStatus(TSDBStatusRequest) returns (TSDBStatusResponse) {};
+rpc Cardinality(CardinalityRequest) returns (CardinalityResponse) {};
 
-message TSDBStatusRequest {
+message CardinalityRequest {
   int32 limit = 1;
 }
 
-message TSDBStatusResponse {
+message CardinalityResponse {
   uint64 num_series = 1;
-  repeated TSDBStatItem series_count_by_metric_name = 2;
-  repeated TSDBStatItem label_value_count_by_label_name = 3;
-  repeated TSDBStatItem series_count_by_label_value_pair = 4;
-}
-
-message TSDBStatItem {
-  string name = 1;
-  uint64 value = 2;
+  repeated cortexpb.CardinalityStatItem series_count_by_metric_name = 2;
+  repeated cortexpb.CardinalityStatItem label_value_count_by_label_name = 3;
+  repeated cortexpb.CardinalityStatItem series_count_by_label_value_pair = 4;
 }
 ```
 
 #### Store Gateway Service
 
-A new `TSDBStatus` RPC is added to the StoreGateway service in `pkg/storegateway/storegatewaypb/gateway.proto`:
+A new `Cardinality` RPC is added to the StoreGateway service in `pkg/storegateway/storegatewaypb/gateway.proto`:
 
 ```protobuf
-rpc TSDBStatus(TSDBStatusRequest) returns (TSDBStatusResponse) {};
+rpc Cardinality(CardinalityRequest) returns (CardinalityResponse) {};
 
-message TSDBStatusRequest {
+message CardinalityRequest {
   int32 limit = 1;
+  int64 min_time = 2;
+  int64 max_time = 3;
+  repeated bytes block_ids = 4;
 }
 
-message TSDBStatusResponse {
+message CardinalityResponse {
   uint64 num_series = 1;
-  repeated TSDBStatItem series_count_by_metric_name = 2;
-  repeated TSDBStatItem label_value_count_by_label_name = 3;
-  repeated TSDBStatItem series_count_by_label_value_pair = 4;
-}
-
-message TSDBStatItem {
-  string name = 1;
-  uint64 value = 2;
+  repeated cortexpb.CardinalityStatItem series_count_by_metric_name = 2;
+  repeated cortexpb.CardinalityStatItem label_value_count_by_label_name = 3;
+  repeated cortexpb.CardinalityStatItem series_count_by_label_value_pair = 4;
 }
 ```
 
-Both the ingester and store gateway response messages share the same fields.
+The store gateway `CardinalityRequest` includes `min_time`, `max_time`, and `block_ids` so the store gateway can filter blocks server-side. This matches the pattern used by other store gateway RPCs (`SeriesRequest`, `LabelNamesRequest`) where the querier routes specific blocks to specific store gateways.
 
 ### Aggregation Logic
 
-Because each series is replicated across multiple ingesters (controlled by the replication factor), the aggregation logic must account for this when merging responses:
-
 #### Head Path Aggregation
 
+Because each series is replicated across multiple ingesters (controlled by the replication factor), the aggregation logic divides by the RF when merging responses. All ingesters must respond (`MaxErrors = 0`) for the RF division to be accurate.
+
 | Field | Aggregation Strategy |
 |---|---|
 | `numSeries` | Sum across ingesters, divide by replication factor |
@@ -146,18 +178,20 @@ Because each series is replicated across multiple ingesters (controlled by the r
 
 The `topNStats` helper function handles the sort-and-truncate step: it divides values by the replication factor, sorts descending by value, and returns the top N items.
 
+**Note on approximation**: The RF division is a best-effort approximation, matching the approach used by `UserStats`. It can undercount when ingesters are in non-ACTIVE states during ring changes, or when shuffle sharding with a lookback period causes uneven distribution. This is acceptable for a diagnostic endpoint — the goal is to identify the largest cardinality contributors, not to produce exact totals.
+
 #### Blocks Path Aggregation
 
-Store gateways use the store gateway ring for replication, so different store gateways may serve the same blocks. The aggregation handles this differently from ingesters:
+The blocks path uses `GetClientsFor` to route each block to exactly one store gateway instance. Since there is no broadcast to all replicas, **no RF division is applied** — each block's statistics are returned exactly once.
 
 | Field | Aggregation Strategy |
 |---|---|
-| `numSeries` | Sum across store gateways, divide by store gateway replication factor |
-| `seriesCountByMetricName` | Sum per metric, divide by SG RF, return top N |
+| `numSeries` | Sum across store gateways (no RF division) |
+| `seriesCountByMetricName` | Sum per metric, return top N |
 | `labelValueCountByLabelName` | Maximum per label |
-| `seriesCountByLabelValuePair` | Sum per pair, divide by SG RF, return top N |
+| `seriesCountByLabelValuePair` | Sum per pair, return top N |
 
-**Note on block overlap**: Before compaction completes, a tenant may have multiple blocks covering the same time range. Series that appear in overlapping blocks within a single store gateway are counted once per block they appear in, so the `numSeries` total may overcount compared to the true unique series count. This is an acceptable approximation — the primary use case is identifying which metrics and label-value pairs contribute the most cardinality, not producing an exact total.
+**Note on block overlap**: Before compaction completes, a tenant may have multiple blocks covering the same time range. Series that appear in overlapping blocks are counted once per block they appear in, so the `numSeries` total may overcount compared to the true unique series count. In practice, with RF ingesters each uploading 2-hour blocks, a 24-hour query range before compaction could have up to `RF * 12` overlapping source blocks, making `numSeries` up to `RF`x the true value. The response includes an `approximated` field set to `true` when overlapping blocks are detected, so consumers know the results may be inflated. The top-N rankings remain useful regardless of overlap — the relative ordering of cardinality contributors is preserved.
 
 ### Block Index Cardinality Computation
 
@@ -196,7 +230,7 @@ This requires expanding posting lists for every label=value combination, which i
 **Recommendation**: This field is computed on-demand using Option A (posting list expansion). To bound the cost:
 - Only expand posting lists for the top N label names by value count (already known from step 1).
 - Within each label name, only expand posting lists for a bounded number of values.
-- Apply a per-request timeout so that very high-cardinality tenants get partial results rather than unbounded computation.
+- Apply the per-tenant `cardinality_query_timeout` (default 60s) so that very high-cardinality tenants get partial results rather than unbounded computation. When a timeout occurs, the response includes whatever results were computed before the deadline, with the `approximated` field set to `true`.
 
 Results are cached per block (see [Caching](#caching)).
 
@@ -206,45 +240,89 @@ The store gateway filters blocks based on the required `start` and `end` paramet
 
 #### Caching
 
-Compacted blocks are immutable — once a block is written to object storage, its contents never change. This means cardinality statistics computed from a block's index can be cached indefinitely (until the block is deleted by the compactor). Each store gateway maintains a per-block cardinality cache keyed by `(block ULID, limit)`. This cache eliminates redundant index traversals when the endpoint is called repeatedly.
+Compacted blocks are immutable — once a block is written to object storage, its contents never change. This means cardinality statistics computed from a block's index can be cached indefinitely (until the block is deleted by the compactor).
+
+Each store gateway maintains a per-block cardinality cache. The cache stores the full (unlimited) result per block ULID, keyed by `(block ULID)`. The `limit` (top-N truncation) is applied at response time from the cached full result. This maximizes cache hit rates — a `limit=10` request followed by a `limit=20` request reuses the same cache entry.
 
-The cache is populated on first request and invalidated when blocks are removed during compaction syncs.
+The cache is populated on first request and invalidated when blocks are removed during compaction syncs. A safety TTL of 24 hours is applied as defense-in-depth for entries that survive block deletion.
+
+**Cache hit/miss metrics**: The cache exposes `cortex_cardinality_cache_hits_total` and `cortex_cardinality_cache_misses_total` counters per store gateway (see [Observability](#observability)).
 
 ### Response Format
 
-The JSON response uses a flat structure. Both `source=head` and `source=blocks` return the same fields:
+The JSON response uses the standard Prometheus API envelope. Both `source=head` and `source=blocks` return the same data fields:
 
 ```json
 {
-  "numSeries": 1500,
-  "seriesCountByMetricName": [
-    {"name": "http_requests_total", "value": 500},
-    {"name": "process_cpu_seconds_total", "value": 200}
-  ],
-  "labelValueCountByLabelName": [
-    {"name": "instance", "value": 50},
-    {"name": "job", "value": 10}
-  ],
-  "seriesCountByLabelValuePair": [
-    {"name": "job=api-server", "value": 300},
-    {"name": "instance=host1:9090", "value": 150}
-  ]
+  "status": "success",
+  "data": {
+    "numSeries": 1500,
+    "approximated": false,
+    "seriesCountByMetricName": [
+      {"name": "http_requests_total", "value": 500},
+      {"name": "process_cpu_seconds_total", "value": 200}
+    ],
+    "labelValueCountByLabelName": [
+      {"name": "instance", "value": 50},
+      {"name": "job", "value": 10}
+    ],
+    "seriesCountByLabelValuePair": [
+      {"name": "job=api-server", "value": 300},
+      {"name": "instance=host1:9090", "value": 150}
+    ]
+  }
 }
 ```
 
+| Field | Description |
+|---|---|
+| `numSeries` | Total number of series (approximate — see notes on RF division and block overlap). |
+| `approximated` | `true` when results may be inflated due to overlapping blocks, partial timeout, or missing blocks after consistency check retries. `false` when results are exact. |
+| `seriesCountByMetricName` | Top N metrics by series count. |
+| `labelValueCountByLabelName` | Top N label names by number of distinct values. |
+| `seriesCountByLabelValuePair` | Top N label=value pairs by series count. |
+
 ### Multi-Tenancy
 
 Tenant isolation is enforced through the existing Cortex authentication middleware. The `X-Scope-OrgID` header identifies the tenant, and the ingester only returns statistics from that tenant's TSDB head. No cross-tenant data leakage is possible because each tenant has a separate TSDB instance in the ingester.
 
 ### Per-Tenant Limits
 
-To prevent expensive cardinality queries from overloading store gateways, the following per-tenant runtime-configurable limit is introduced:
+To prevent expensive cardinality queries from overloading the system, the following per-tenant runtime-configurable limits are introduced:
 
 | Limit | Flag | YAML | Default | Description |
 |---|---|---|---|---|
+| `cardinality_api_enabled` | `-querier.cardinality-api-enabled` | `cardinality_api_enabled` | `false` | Enables the cardinality API for this tenant. When disabled, the endpoint returns HTTP 403. |
 | `cardinality_max_query_range` | `-querier.cardinality-max-query-range` | `cardinality_max_query_range` | `24h` | Maximum allowed time range (`end - start`) for `source=blocks` cardinality queries. |
+| `cardinality_max_concurrent_requests` | `-querier.cardinality-max-concurrent-requests` | `cardinality_max_concurrent_requests` | `2` | Maximum number of concurrent cardinality requests per tenant. Excess requests are rejected with HTTP 429. |
+| `cardinality_query_timeout` | `-querier.cardinality-query-timeout` | `cardinality_query_timeout` | `60s` | Per-request timeout for cardinality computation. On timeout, partial results are returned with `approximated: true`. |
+
+When a `source=blocks` request exceeds the `cardinality_max_query_range` limit, the endpoint returns HTTP 400 with an error message following the pattern used by `max_query_length` violations: `"the query time range exceeds the limit (query length: %s, limit: %s)"`.
+
+### Observability
+
+The cardinality endpoint exposes the following metrics:
+
+| Metric | Type | Labels | Description |
+|---|---|---|---|
+| `cortex_cardinality_request_duration_seconds` | Histogram | `source`, `status_code` | End-to-end request duration. |
+| `cortex_cardinality_requests_total` | Counter | `source`, `status_code` | Total requests by source and result. |
+| `cortex_cardinality_inflight_requests` | Gauge | `source` | Current number of in-flight cardinality requests. |
+| `cortex_cardinality_cache_hits_total` | Counter | — | Per-block cardinality cache hits (store gateway). |
+| `cortex_cardinality_cache_misses_total` | Counter | — | Per-block cardinality cache misses (store gateway). |
+| `cortex_cardinality_blocks_queried_total` | Counter | — | Number of blocks analyzed per request (store gateway). |
+
+These metrics are registered with `promauto.With(reg)` following Cortex conventions — no global registerer is used.
+
+## Rollout Plan
+
+The cardinality API is introduced as an **experimental** feature behind the `cardinality_api_enabled` per-tenant flag (default `false`).
+
+**Phase 1 — Head path**: Implement the `source=head` path (ingester fan-out, RF-based aggregation). This is lower risk since it only queries in-memory TSDB heads with bounded data. Enable for a small set of tenants for validation.
+
+**Phase 2 — Blocks path**: Implement the `source=blocks` path (store gateway fan-out, block index analysis, caching). This is higher risk due to potential object storage I/O and larger data volumes. Enable selectively behind the same flag.
 
-When a `source=blocks` request exceeds this limit, the endpoint returns HTTP 422 with an error message indicating the maximum allowed range. This gives operators control over the blast radius per tenant — high-value tenants can be granted wider windows while keeping a safe default for everyone else.
+**Phase 3 — GA**: After validation, change `cardinality_api_enabled` default to `true` and graduate from experimental status.
 
 ## Design Alternatives
 
@@ -255,7 +333,7 @@ This design routes the endpoint through the **Querier**, which handles the HTTP
 **Current approach (Querier):**
 - Provides logical separation — this is a read-only diagnostic endpoint and belongs on the read path alongside other query APIs.
 - Follows the pattern used by the `/api/v1/metadata` endpoint, which is registered via `NewQuerierHandler` and delegates to the Distributor's `MetricsMetadata` method.
-- Requires adding `TSDBStatus` to the Querier's Distributor interface (`pkg/querier/distributor_queryable.go`) and a handler in the Querier package.
+- Requires adding `Cardinality` to the Querier's Distributor interface (`pkg/querier/distributor_queryable.go`) and a handler in the Querier package.
 
 **Alternative (Distributor):**
 - Follows the pattern used by the `UserStats` endpoint, which is registered directly on the Distributor.
@@ -303,22 +381,27 @@ The implementation spans the following key files:
 
 ### Head Path (Ingester)
 
-- `pkg/api/handlers.go` - Route registration in `NewQuerierHandler`
-- `pkg/querier/tsdb_status_handler.go` - HTTP handler (`TSDBStatusHandler`)
-- `pkg/querier/distributor_queryable.go` - `TSDBStatus` added to the Distributor interface
-- `pkg/distributor/distributor.go` - Fan-out to ingesters and aggregation logic (`TSDBStatus`, `topNStats`)
-- `pkg/ingester/ingester.go` - Per-tenant TSDB head stats retrieval (`TSDBStatus`, `statsToPB`)
-- `pkg/ingester/client/ingester.proto` - gRPC message definitions (`TSDBStatusRequest`, `TSDBStatusResponse`, `TSDBStatItem`)
+- `pkg/api/handlers.go` - Route registration in `NewQuerierHandler` (both prometheus and legacy prefixes)
+- `pkg/querier/cardinality_handler.go` - HTTP handler (`CardinalityHandler`)
+- `pkg/querier/cardinality_handler_test.go` - Handler unit tests
+- `pkg/querier/distributor_queryable.go` - `Cardinality` added to the Distributor interface
+- `pkg/distributor/distributor.go` - Fan-out to ingesters and aggregation logic (`Cardinality`, `topNStats`)
+- `pkg/distributor/distributor_test.go` - Aggregation unit tests
+- `pkg/ingester/ingester.go` - Per-tenant TSDB head stats retrieval (`Cardinality`, `statsToPB`)
+- `pkg/ingester/client/ingester.proto` - gRPC message definitions (`CardinalityRequest`, `CardinalityResponse`)
 
 ### Blocks Path (Store Gateway)
 
-- `pkg/querier/tsdb_status_handler.go` - HTTP handler routes `source=blocks` to store gateway path
-- `pkg/querier/blocks_store_queryable.go` - `TSDBStatus` added to the store gateway query interface
-- `pkg/storegateway/storegatewaypb/gateway.proto` - gRPC message definitions for store gateway `TSDBStatus` RPC
-- `pkg/storegateway/gateway.go` - Store gateway `TSDBStatus` handler, delegates to `ThanosBucketStores`
+- `pkg/querier/cardinality_handler.go` - HTTP handler routes `source=blocks` to store gateway path
+- `pkg/querier/blocks_store_queryable.go` - `Cardinality` added to the store gateway query interface
+- `pkg/storegateway/storegatewaypb/gateway.proto` - gRPC message definitions for store gateway `Cardinality` RPC
+- `pkg/storegateway/gateway.go` - Store gateway `Cardinality` handler, delegates to `ThanosBucketStores`
 - `pkg/storegateway/bucket_stores.go` - Per-tenant block iteration and cardinality computation from index headers and block indexes
 
 ### Shared
 
+- `pkg/cortexpb/cardinality.proto` - Shared `CardinalityStatItem` message definition
+- `pkg/util/validation/limits.go` - Per-tenant limit definitions (`cardinality_api_enabled`, `cardinality_max_query_range`, `cardinality_max_concurrent_requests`, `cardinality_query_timeout`)
+- `pkg/util/validation/exporter.go` - Overrides exporter for new limits
 - `docs/api/_index.md` - API documentation (updated with `source`, `start`, and `end` parameters)
 - `integration/api_endpoints_test.go` - Integration tests for both head and blocks paths

From 8199b43fb3df4327da0564b9462455b3fa9c593e Mon Sep 17 00:00:00 2001
From: Charlie Le <charlie_le@apple.com>
Date: Fri, 13 Mar 2026 13:51:27 -0700
Subject: [PATCH 6/6] Rename proposal file to per-tenant-cardinality-api.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
---
 ...er-tenant-tsdb-status-api.md => per-tenant-cardinality-api.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/proposals/{per-tenant-tsdb-status-api.md => per-tenant-cardinality-api.md} (100%)

diff --git a/docs/proposals/per-tenant-tsdb-status-api.md b/docs/proposals/per-tenant-cardinality-api.md
similarity index 100%
rename from docs/proposals/per-tenant-tsdb-status-api.md
rename to docs/proposals/per-tenant-cardinality-api.md