Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
0e35e5f
feat: agentic benchmark ingest + UI with offload-mode halo
cquil11 Apr 23, 2026
9c43a76
fix: agentic offload variants — render both halos + map renamed fields
cquil11 May 1, 2026
07ba106
fix: render offload halo on every offload-on point, not just frontier
cquil11 May 1, 2026
95e9dc7
fix: strip runner-pool suffix (-p1, -p2, ...) from hw identifier
cquil11 May 1, 2026
982106d
feat: bold scatter labels with concurrency tag + collision avoidance
cquil11 May 1, 2026
9572b95
fix: stack multi-line point labels upward so they don't overlap the p…
cquil11 May 1, 2026
37eecc6
fix: anchor multi-line labels via first tspan + tspan-aware collision…
cquil11 May 1, 2026
f317377
fix: dedupe artifacts by logical name + skip 0-successful agg rows
cquil11 May 1, 2026
52d35ba
Merge remote-tracking branch 'origin/master' into feat/agentx
cquil11 May 1, 2026
c2f66f6
feat: add AIPerf to FRAMEWORK_LABELS
cquil11 May 7, 2026
024797a
fix(changelog): coerce ids to string when filtering changelog by run
cquil11 May 12, 2026
aa15419
feat: default sequence to Agentic Traces when available
cquil11 May 12, 2026
cb4e87c
Merge remote-tracking branch 'origin/master' into feat/agentx
cquil11 May 14, 2026
099a33e
fix(agentic): respect percentile selector for input-throughput x axis
cquil11 May 15, 2026
50a06d1
fix(agentic): default percentile to p99 and drop median option
cquil11 May 15, 2026
25305dc
Merge remote-tracking branch 'origin/master' into feat/agentx
cquil11 May 15, 2026
3c96e91
fix(agentic): keep only p90 as the percentile option
cquil11 May 15, 2026
642081a
fix(agentic): default percentile to p90, surface only p90/p99
functionstackx May 15, 2026
3f45f4d
fix(agentic): drop p99 + median TTFT, p90 only across selectors
functionstackx May 15, 2026
03c775a
fix(agentic): honor e2e TTFT override in agentic mode too
functionstackx May 15, 2026
49f2b27
fix(agentic): default e2e chart x-axis to p90 TTFT
functionstackx May 15, 2026
9e2c532
fix(tooltip): cap data-point numeric values at 3 decimal places
cquil11 May 15, 2026
50ed25f
fix(agentic): relabel x-axis title for natural-x case too
cquil11 May 15, 2026
e9d8e3f
fix(agentic): include percentile word in chart heading
cquil11 May 15, 2026
2046282
fix(agentic): include percentile in e2e chart heading dropdown
cquil11 May 15, 2026
9957f19
feat(agentic): per-point trace_replay storage + detail page POC
cquil11 May 20, 2026
0067bfc
feat(agentic): hover crosshair + expand-to-dialog on detail charts
cquil11 May 21, 2026
1d502ac
feat(inference): one chart with TTFT / E2E / Interactivity x-axis picker
cquil11 May 21, 2026
965c862
fix(inference): TTFT/E2E pick metric by sequence kind + add P75 option
cquil11 May 21, 2026
e4d97f2
feat(metrics): wire P75/P95 through frontend + register new aiperf keys
cquil11 May 21, 2026
a7a1354
fix(inference): don't drop agentic TTFT points over 60s as outliers
cquil11 May 21, 2026
07194de
fix(trace-histograms): chunk DB query + blob-cache to escape size caps
cquil11 May 21, 2026
a1e594b
feat(inference): run selector actually filters chart data
cquil11 May 21, 2026
b0d228a
feat(inference): Session Time + Prefill TPS x-axis (live from trace b…
cquil11 May 21, 2026
8af1f5c
fix(inference): show Mean Normalized Session Time in minutes
functionstackx May 21, 2026
be34e97
fix(inference): use global P90 of per-turn prefill TPS/user
functionstackx May 21, 2026
c774c00
fix(inference): no-data flash on session-time / prefill-tps modes
functionstackx May 21, 2026
d5dbda7
feat(agentic-detail): aggregates-across-configs view
cquil11 May 21, 2026
41ef33b
fix(agentic-aggregates): metric name + stream-parse oversized blobs
cquil11 May 21, 2026
1cedd24
feat(agentic-aggregates): pre-compute stats at ingest time
cquil11 May 21, 2026
9d9c7c1
fix(agentic-aggregates): drop .js extension on app-route-traced import
cquil11 May 21, 2026
6063d01
feat(agentic-detail): pre-compute chart_series at ingest time
cquil11 May 21, 2026
24fe8fe
feat(agentic-detail): per-request Gantt timeline view
cquil11 May 22, 2026
f2618f4
fix(agentic-detail): aggregate vllm metrics across all engine series
cquil11 May 22, 2026
b3e315c
fix(scenario-selector): wrap "Deprecated" in SelectLabel + lead with …
cquil11 May 26, 2026
19b9958
fix(scenario-selector): wrap Deprecated header in SelectLabel only in…
cquil11 May 26, 2026
7114833
feat(agentic-detail): add cumulative input tokens chart
cquil11 May 27, 2026
c6697de
feat(agentic-detail): plot cumulative unique input tokens
cquil11 May 27, 2026
b5679bb
feat(request-timeline): expandable subagent -> stream rows
cquil11 May 27, 2026
2e1f1ce
fix(agentic-detail): make unique-input-tokens chart monotonic
cquil11 May 27, 2026
08bbe66
feat(agentic-detail): add unique input tokens in flight chart
cquil11 May 27, 2026
7561deb
feat(chart-series): extract SGLang metrics alongside vllm
cquil11 May 28, 2026
625d6e8
fix(ingest): derive GPU cache hit rate for SGLang at ingest time
cquil11 May 28, 2026
aa76e9e
feat(chart-series): map sglang:realtime_tokens to promptTokensBySource
cquil11 May 28, 2026
5872a3d
feat(chart-series): break out SGLang cache hits by cache_source
cquil11 May 28, 2026
94a3e8b
feat(chart-series): host cache util line + fix SGLang stacked-area co…
cquil11 May 28, 2026
93e197b
fix(stacked-area): align sources by timestamp before computing shares
cquil11 May 28, 2026
c14e19e
fix(ingest): split GPU vs CPU cache hit rate for SGLang hicache rows
cquil11 May 28, 2026
268617c
fix(ingest): recognize vLLM LMCache external_kv_transfer as CPU hit
cquil11 Jun 3, 2026
7fc6b4f
fix(scatter): use lightweight presence endpoint for View charts button
cquil11 Jun 4, 2026
80468eb
feat(chart-series): per-DP-rank KV cache utilization overlay
cquil11 Jun 4, 2026
3a5ef15
feat(scatter): restrict non-e2e xmodes to e2e-pareto points
cquil11 Jun 4, 2026
5035e17
fix(scatter): keep non-pareto points visible on non-e2e xmodes
cquil11 Jun 4, 2026
2bfea38
fix(scatter): scope e2e-pareto restriction to agentic only
cquil11 Jun 4, 2026
cbeeb69
feat(legend): info tooltip on Optimal Only for agentic non-e2e modes
cquil11 Jun 4, 2026
de5e51a
fix(inference): don't scope chart to one run when runs cover differen…
cquil11 Jun 4, 2026
72e1cbb
Merge remote-tracking branch 'origin/master' into feat/agentx
cquil11 Jun 9, 2026
af8766d
fix(inference): carry forward un-contested configs when a run is sele…
cquil11 Jun 11, 2026
ab5f4f9
fix(agentic): derive unique input tokens from prompt-source breakdown
cquil11 Jun 17, 2026
d6d3143
fix: reconcile agentic data after master merge
cquil11 Jun 17, 2026
f60ef9c
fix(gpu-compare): show concurrency (C=) over points
cquil11 Jun 17, 2026
22028cc
fix(agentic-timeline): hide no-op phase toggle; fixed-height scroll w…
cquil11 Jun 17, 2026
28d25a5
feat(agentic-timeline): sticky bottom h-scroll + double-click to rese…
cquil11 Jun 17, 2026
6e56bbf
fix(gpu-compare): show CPU-offload halo on points
cquil11 Jun 18, 2026
2c06009
fix(high-contrast): use full hue wheel for single-vendor comparisons
cquil11 Jun 18, 2026
68b35b7
Merge remote-tracking branch 'origin/master' into feat/agentx
cquil11 Jun 22, 2026
6275aa7
feat(inference): default line labels off, parallelism labels + high c…
cquil11 Jun 22, 2026
5c290a4
feat(agentic): use the chart's TP/EP/DEP/TEP parallelism labels on si…
cquil11 Jun 22, 2026
32adf6b
feat(agentic): sort dropdown for the sibling point navigator
cquil11 Jun 22, 2026
60c5c2d
feat(datasets): add 011 schema for datasets + dataset_conversations
cquil11 Jun 22, 2026
71e388f
feat(datasets): weka trace structure + cached-prefix builder
cquil11 Jun 22, 2026
9fbc716
feat(datasets): HF cc-traces-weka ingest script
cquil11 Jun 22, 2026
b6be5a8
fix(datasets): handle HF 429 rate-limiting in ingest
cquil11 Jun 22, 2026
a376b5b
feat(datasets): DB queries, API routes, and React Query hooks
cquil11 Jun 22, 2026
574dfcc
feat(datasets): /datasets pages, distribution cards, flamegraph, nav
cquil11 Jun 22, 2026
0c50139
docs(ingest): note the separate agentic-dataset ingest script
cquil11 Jun 22, 2026
2ae6eba
fix(datasets): flamegraph scroll box + dual-scale group bars
cquil11 Jun 22, 2026
c749f8f
feat(datasets): link request timeline to source-dataset conversation
cquil11 Jun 22, 2026
6b700a3
feat(datasets): deep-link request-timeline bar to the exact turn
cquil11 Jun 22, 2026
83fcd04
fix(datasets): visible turn highlight + pointer-tracking flamegraph t…
cquil11 Jun 22, 2026
3c40d31
fix(datasets): deep-link highlight fires on first navigation
cquil11 Jun 22, 2026
e460ea2
fix(high-contrast): stable line colors when deselecting legend items
cquil11 Jun 23, 2026
605bff7
merge origin/master into feat/agentx; resolve quick-filter/category-s…
adibarra Jun 23, 2026
a912eab
chore(security): bump dompurify override to >=3.4.11 (GHSA-cmwh-pvxp-…
adibarra Jun 23, 2026
ba6bc1c
test(e2e): align selector testid with scenario-selector rename; rewri…
adibarra Jun 23, 2026
ada19b5
test(datasets): component tests for distribution card, trace flamegra…
adibarra Jun 23, 2026
1c61ee3
refactor(datasets): extract shared compact() formatter, dedupe 5 loca…
adibarra Jun 23, 2026
e2e5424
refactor(db): squash agentic migrations into 007_agentic.sql so numbe…
adibarra Jun 23, 2026
772dfef
add agentic time-series and dataset timing
cquil11 Jun 23, 2026
13471d7
add dataset percentile distributions
cquil11 Jun 23, 2026
8bfe664
use cumulative percentiles for agentic charts
cquil11 Jun 23, 2026
e3e0bf4
fix(db): build each chart line from a single run, no cross-run/date s…
adibarra Jun 23, 2026
2c3bb6d
Default agentic charts to interactivity
cquil11 Jun 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions .claude/agents/ingest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
name: ingest
description: Ingest a benchmark run from GitHub Actions into the Neon DB used by the feat/agentx deployment. The target DB write URL must be provided in the invocation. Handles standard ingest, delete+reingest, and changelog entries. Invoke when the user asks to ingest a workflow run URL.
tools: Bash, Read, Edit, Write
---

You ingest benchmark runs from `SemiAnalysisAI/InferenceX` GitHub Actions into the Neon branch used by the `feat/agentx` deployment of this dashboard. Operate on `/Users/quilicic/InferenceX-app`.

## Environment

- **Repo root**: `/Users/quilicic/InferenceX-app`
- **DB write URL — MUST be provided by the invoker.** There is no default: the target Neon branch changes over time, and ingesting into the wrong one silently corrupts a live deployment. If the prompt does not include a `postgresql://` write URL, STOP and ask for it before touching anything. Requirements:
- Use the **direct (non-pooled)** host for ingest/migrations — no `-pooler` in the hostname.
- For psql diagnostics you may use the same URL directly: `psql "$DATABASE_WRITE_URL" -c "..."`.
- **Local dev server**: usually `http://localhost:3002` (port 3000 is a different project on this machine — never purge port 3000)
- **Preview URL**: `https://inferencemax-app-git-feat-agentx-semianalysisai.vercel.app`
- **INVALIDATE_SECRET** lives in repo root `.env` under that key.
- **GitHub auth**: `gh auth token` for `gh` calls and the GITHUB_TOKEN env var.

## Standard ingest

```bash
cd /Users/quilicic/InferenceX-app/packages/db
DATABASE_WRITE_URL='<provided direct non-pooled write URL>' \
GITHUB_TOKEN=$(gh auth token) \
pnpm exec tsx src/ingest-ci-run.ts --download <RUN_ID> SemiAnalysisAI/InferenceX
```

Then refresh the materialized view (the script's auto-refresh sometimes races):
`REFRESH MATERIALIZED VIEW latest_benchmarks;`

## Cache purge (always do after any DB mutation)

```bash
SECRET=$(grep "^INVALIDATE_SECRET" /Users/quilicic/InferenceX-app/.env | cut -d= -f2 | tr -d '"')
# Localhost (port 3002, NOT 3000)
curl -s -X POST -H "Authorization: Bearer $SECRET" http://localhost:3002/api/v1/invalidate
# Preview
mkdir -p /tmp/vp && cd /tmp/vp \
&& vercel link --project inferencemax-app --scope semianalysisai --yes >/dev/null 2>&1 \
&& vercel curl /api/v1/invalidate \
--deployment https://inferencemax-app-git-feat-agentx-semianalysisai.vercel.app \
--yes -- -sS -X POST -H "Authorization: Bearer $SECRET"
rm -rf /tmp/vp
```

## Delete + reingest (use only when user explicitly says "delete and reingest" OR when the run supersedes prior data with the same (model, hw, framework, precision))

```sql
BEGIN;
DELETE FROM benchmark_results br USING configs c
WHERE c.id = br.config_id
AND c.model = '<model>' AND c.hardware = '<hw>' AND c.framework = '<framework>'
AND c.precision = '<prec>' AND br.benchmark_type = '<bt>';
DELETE FROM availability
WHERE model = '<model>' AND hardware = '<hw>' AND framework = '<framework>'
AND precision = '<prec>' AND benchmark_type = '<bt>';
COMMIT;
```

If the user says "replace ONLY the points this run produces", scope the DELETE to `AND br.conc IN (...)` so untouched conc levels survive. Don't do this unless asked.

## AIPerf tagging — DO NOT use by default

AIPerf is no longer a separate harness from the user's perspective. **Always** ingest with `spec_method='none'` (the standard path above), regardless of run name. Run names that include the word "aiperf" do NOT mean you should set `spec_decoding='aiperf'` — the user wants those runs to merge into the standard legend entry alongside other runs of the same (model, hw, framework, precision).

Only override this if the user **explicitly** asks for the run to appear as a separate legend line. If they do, the patching procedure is preserved below. Otherwise, use the standard ingest section above and do not touch `spec_decoding`.

<details>
<summary>Explicit-request-only: how to tag a run as `spec_decoding='aiperf'`</summary>

```bash
RID=<run_id>
TMPDIR=$(mktemp -d -t aiperf-$RID-XXXX)
cd $TMPDIR

# 1. Logical-name dedup + download
gh api "repos/SemiAnalysisAI/InferenceX/actions/runs/$RID/artifacts" --paginate \
--jq '.artifacts[] | "\(.name)\t\(.archive_download_url)\t\(.created_at)"' \
| python3 -c "
import sys, re, collections
seen = collections.OrderedDict()
for line in sys.stdin:
name, url, created = line.rstrip('\n').split('\t')
key = re.sub(r'_[a-zA-Z][a-zA-Z0-9.-]*_\d+$', '', name)
if key not in seen or seen[key][2] < created:
seen[key] = (name, url, created)
for _, (name, url, _) in seen.items():
print(f'{name}\t{url}')
" > artifacts.tsv
while IFS=$'\t' read -r name url; do
mkdir -p "$name"
gh api "$url" > "$name/a.zip" 2>/dev/null
unzip -oq "$name/a.zip" -d "$name" 2>/dev/null
rm "$name/a.zip"
done < artifacts.tsv

# 2. Patch every benchmark JSON to set spec_decoding=aiperf
find $TMPDIR -name "*.json" | python3 -c "
import sys, json
for fn in (l.strip() for l in sys.stdin):
try:
with open(fn) as f: d = json.load(f)
except Exception: continue
rows = d if isinstance(d, list) else [d]
if not rows or not isinstance(rows[0], dict): continue
changed = False
for row in rows:
if isinstance(row, dict) and ('scenario_type' in row or 'infmax_model_prefix' in row or 'tput_per_gpu' in row):
row['spec_decoding'] = 'aiperf'
changed = True
if changed:
with open(fn, 'w') as f: json.dump(d if isinstance(d, list) else rows[0], f)
"

# 3. Ingest in CI mode (reads INGEST_* env vars)
cd /Users/quilicic/InferenceX-app/packages/db
INGEST_RUN_ID=$RID INGEST_RUN_ATTEMPT=1 INGEST_ARTIFACTS_PATH=$TMPDIR INGEST_REPO=SemiAnalysisAI/InferenceX \
DATABASE_WRITE_URL='<provided direct non-pooled write URL>' \
GITHUB_TOKEN=$(gh auth token) \
pnpm exec tsx src/ingest-ci-run.ts
rm -rf $TMPDIR
```

The `spec_method` column has a lowercase check constraint — always lowercase.

</details>

## Don't auto-mention "AIPerf" in changelog entries

Changelog descriptions used to include "AIPerf harness" wording. Don't add this anymore — the user considers AIPerf the standard harness now. A run named "e2e Test - kimi aiperf w/ live assistant" should become a changelog entry like `B200 Kimi Ingest #N (live assistant)`, not `... (AIPerf harness, live assistant)`.

## Adding a perf changelog entry

Run AFTER ingest. The popover filters by `config_keys[].split('-')[1] === selected_precision` and drops entries with empty `config_keys`, so you MUST provide at least one config_key in the format `<model>-<precision>-<hw>-<framework>` (matches what the user actually sees in the filter chain).

```sql
INSERT INTO changelog_entries (workflow_run_id, date, base_ref, head_ref, config_keys, description, pr_link)
SELECT id, date, '', '', ARRAY['<model>-<precision>-<hw>-<framework>'], '<description>', NULL
FROM latest_workflow_runs WHERE github_run_id = <RUN_ID>
RETURNING id, workflow_run_id, date::text, description;
```

Description convention from prior entries: `<HW upper> <Model> Ingest #<N> (<note>)` — e.g.

- `B200 Kimi Ingest #1`
- `MI355X Kimi Ingest #2`
- `H200 Kimi Ingest #1 (mmap cache)`

If user doesn't specify a description, ask for one OR derive from the run name.

## Common gotchas

- **`conclusion IS NULL` filter**: availability hides runs whose `latest_workflow_runs.conclusion` is null (still in_progress). If a user wants in-progress data shown, you can `UPDATE workflow_runs SET conclusion='success', status='completed' WHERE id = <wr_id>` then `REFRESH MATERIALIZED VIEW latest_benchmarks`.
- **failed_run filter**: rows where `num_requests_successful === 0 AND num_requests_total > 0` get skipped on purpose — they have null metrics and would overwrite good rows via ON CONFLICT.
- **Aggregated `results_bmk` artifact** contains rows from all runner attempts merged together — pair the artifact-level logical-name dedup with the row-level failed-run skip to avoid empty-row overwrites.
- **Multi-attempt artifacts**: a single GitHub run can spill across runners (`h200-cw_00` + `h200-dgxc-slurm_1`); the logical-name dedup strips the `_<runner>_<attempt>` suffix.
- **Materialized view dedup tiebreaker**: `latest_benchmarks` picks rows by `date DESC, wr.run_started_at DESC`. Backfilling old data may not surface unless dates align with the user's date picker selection.
- **Date alignment for partial runs**: when a re-run only covers a subset of concs (`replace ONLY the points this run produces`), align dates with prior full sweep via `UPDATE benchmark_results.date = '<full-sweep-date>'` so the frontend's max-date-per-group dedup doesn't drop the older sweep.

## Process

1. **Always start by checking the run** with `gh api repos/SemiAnalysisAI/InferenceX/actions/runs/<RID> --jq '{name, status, conclusion}'`. Note the model/hw/precision from the name. If `status != "completed"`, ask the user if they want to ingest in-progress data (will likely have failed_run skips).
2. **Check the DB** for any pre-existing rows for this run or the same (model, hw, framework, precision) combo if the user mentioned superseding.
3. **Ingest** via the standard path. Do NOT use AIPerf tagging unless the user explicitly asks for a separate legend line.
4. **Refresh materialized view**.
5. **Add changelog entry** if the user asked or if the run is a "marker" worth surfacing.
6. **Purge both caches** (localhost 3002 + preview).
7. **Report** the row count, date, hardware, run id, and changelog id (if added).

## Related: ingesting agentic _datasets_ (not benchmark runs)

This agent ingests **benchmark runs**. The HF agentic trace **datasets** (`semianalysisai/cc-traces-weka-*`) that the agentic benchmark replays are ingested by a separate script, not this flow:

```bash
cd packages/db && DATABASE_WRITE_URL='<direct write url>' \
pnpm exec tsx src/ingest-weka-dataset.ts <hf-dataset-id> \
[--label "…"] [--variant full|256k] [--description "…"] [--limit N]
```

It populates the `datasets` + `dataset_conversations` tables (migration `007_agentic.sql`) that back the `/datasets` pages — upsert/replace per dataset, then purge the API cache like any other ingest. Same write-URL rule applies (direct, non-pooled, provided by the invoker).

## Don't

- Don't push to git unless the user asked.
- Don't ingest without permission if it's a delete+reingest of existing data.
- Don't hit port 3000 for cache purge — it's a different project.
- Don't capitalize `spec_method` values (DB has a lowercase check constraint).
3 changes: 3 additions & 0 deletions .eslintignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Stale agent worktrees produced by parallel Claude Code sessions — they
# hold their own branches and are linted as part of their own runs.
.claude/worktrees/
1 change: 1 addition & 0 deletions .oxlintrc.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
"no-undef": "off",
"no-underscore-dangle": "off",
"no-useless-undefined": "off",
"require-unicode-regexp": "off",
"no-warning-comments": "off",
"prefer-destructuring": "off",
"sort-imports": "off",
Expand Down
93 changes: 93 additions & 0 deletions packages/app/cypress/component/dataset-list.cy.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
import { QueryClient, QueryClientProvider } from '@tanstack/react-query';
import { AppRouterContext } from 'next/dist/shared/lib/app-router-context.shared-runtime';

import { DatasetList } from '@/components/datasets/dataset-list';
import type { DatasetRecord } from '@/hooks/api/use-datasets';

const datasets: DatasetRecord[] = [
{
id: 'ds-1',
slug: 'cc-traces-weka-full',
label: 'cc-traces-weka (full)',
variant: 'full',
description: 'Every captured request, unmodified.',
hf_url: 'https://huggingface.co/datasets/semianalysisai/cc-traces-weka-full',
license: 'apache-2.0',
conversation_count: 1234,
summary: {
totalIn: 5_000_000,
totalOut: 250_000,
cachedPct: 0.82,
mainTurns: 9800,
subagentGroups: 540,
},
ingested_at: '2026-06-20T00:00:00Z',
},
{
id: 'ds-2',
slug: 'cc-traces-weka-256k',
label: 'cc-traces-weka (256k)',
variant: '256k',
description: 'Turns trimmed to a 256k context window.',
hf_url: null,
license: 'apache-2.0',
conversation_count: 980,
summary: {
totalIn: 3_200_000,
totalOut: 180_000,
cachedPct: 0.79,
mainTurns: 7600,
subagentGroups: 410,
},
ingested_at: '2026-06-19T00:00:00Z',
},
];

function createMockRouter() {
return {
push: cy.stub(),
replace: cy.stub(),
refresh: cy.stub(),
back: cy.stub(),
forward: cy.stub(),
prefetch: cy.stub().resolves(),
};
}

function mountList() {
const queryClient = new QueryClient({ defaultOptions: { queries: { retry: false } } });
cy.mount(
<AppRouterContext.Provider value={createMockRouter()}>
<QueryClientProvider client={queryClient}>
<DatasetList />
</QueryClientProvider>
</AppRouterContext.Provider>,
);
}

describe('DatasetList', () => {
it('renders a card per dataset with its summary stats', () => {
cy.intercept('GET', '/api/v1/datasets', { statusCode: 200, body: datasets }).as('list');
mountList();
cy.wait('@list');
cy.contains('cc-traces-weka (full)').should('be.visible');
cy.contains('cc-traces-weka (256k)').should('be.visible');
cy.contains('1,234').should('be.visible'); // conversation_count, localized
cy.contains('82%').should('be.visible'); // cachedPct
cy.get('a[href="/datasets/cc-traces-weka-full"]').should('exist');
});

it('shows the empty state when no datasets are ingested', () => {
cy.intercept('GET', '/api/v1/datasets', { statusCode: 200, body: [] }).as('empty');
mountList();
cy.wait('@empty');
cy.contains('No datasets ingested yet.').should('be.visible');
});

it('shows the error state when the request fails', () => {
cy.intercept('GET', '/api/v1/datasets', { statusCode: 500, body: { error: 'boom' } }).as('err');
mountList();
cy.wait('@err');
cy.contains('Failed to load datasets.').should('be.visible');
});
});
82 changes: 82 additions & 0 deletions packages/app/cypress/component/distribution-card.cy.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import { DistributionCard } from '@/components/datasets/distribution-card';
import type { Distribution } from '@/hooks/api/use-datasets';

const distribution: Distribution = {
bins: [
{ x0: 0, x1: 100, count: 5 },
{ x0: 100, x1: 200, count: 20 },
{ x0: 200, x1: 300, count: 12 },
{ x0: 300, x1: 400, count: 3 },
],
stats: {
count: 40,
min: 10,
max: 390,
mean: 180,
median: 175,
p75: 250,
p90: 320,
p95: 360,
},
};

describe('DistributionCard', () => {
it('renders the title, summary stats, and one bar per bin', () => {
cy.mount(
<DistributionCard title="Input tokens per turn" unit="tok" distribution={distribution} />,
);
cy.contains('Input tokens per turn').should('be.visible');
cy.contains('n=40').should('be.visible');
cy.contains('p50 175').should('be.visible');
cy.contains('p75 250').should('be.visible');
cy.contains('p90 320').should('be.visible');
cy.contains('p95 360').should('be.visible');
cy.get(
'line[stroke="#3b82f6"], line[stroke="#22c55e"], line[stroke="#f59e0b"], line[stroke="#ef4444"]',
).should('have.length', 8);
// One filled bar rect per bin (ChartHover may add a transparent overlay rect).
cy.get('rect[class*="fill-primary"]').should('have.length', distribution.bins.length);
});

it('shows a "No data" placeholder when no distribution is provided', () => {
cy.mount(<DistributionCard title="Empty metric" unit="tok" />);
cy.contains('Empty metric').should('be.visible');
cy.contains('No data').should('be.visible');
cy.get('rect[class*="fill-primary"]').should('not.exist');
});

it('marks the chart as log scale when scale="log"', () => {
cy.mount(
<DistributionCard
title="Output tokens per turn"
unit="tok"
scale="log"
distribution={distribution}
/>,
);
cy.contains('log scale').should('be.visible');
});

it('renders older v1 stats without unavailable percentile guides', () => {
cy.mount(
<DistributionCard
title="Legacy metric"
unit="tok"
distribution={{
bins: distribution.bins,
stats: {
count: 40,
min: 10,
max: 390,
mean: 180,
median: 175,
p90: 320,
},
}}
/>,
);
cy.contains('p50 175').should('be.visible');
cy.contains('p90 320').should('be.visible');
cy.contains('NaN').should('not.exist');
});
});
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ describe('Inference ChartControls', () => {

it('renders the sequence selector with the current sequence', () => {
// Default mock: selectedSequence = Sequence.EightK_OneK -> label "8K / 1K"
cy.get('#sequence-select').should('be.visible');
cy.get('#sequence-select').should('contain.text', '8K / 1K');
cy.get('#scenario-select').should('be.visible');
cy.get('#scenario-select').should('contain.text', '8K / 1K');
});

it('renders the precision multi-select with the current precision', () => {
Expand Down
Loading