Skip to content

[WIP] agentx#348

Draft
cquil11 wants to merge 92 commits into
masterfrom
feat/agentx
Draft

[WIP] agentx#348
cquil11 wants to merge 92 commits into
masterfrom
feat/agentx

Conversation

@cquil11

@cquil11 cquil11 commented May 14, 2026

Copy link
Copy Markdown
Contributor

No description provided.

cquil11 and others added 12 commits April 23, 2026 13:40
Adds agentic_traces scenario end-to-end:
- Schema migrations for agentic scenario, availability, and KV offload mode
- DB ingest/ETL + query updates to carry scenario, offload_mode, and
  server/theoretical cache-hit rates through to the API layer
- Frontend types, filters (GlobalFilterContext / InferenceContext /
  ChartControls), URL state, and tooltip rows for agentic-only fields
- ScatterGraph: subtle dashed halo on Pareto-frontier points that used
  KV offload so the tradeoff is visible at a glance
- ScatterGraph: include `offload_mode` in `buildPointConfigId` so d3's data
  join keeps both `on` and `off` variants for the same (config, conc).
  Without it, the second variant collapsed onto the first key, so FP8
  offload-on points (and their halos) silently disappeared.
- benchmark-mapper: handle older artifacts that emit `users`/`offload_mode`
  AND newer ones that emit `conc`/`offloading` (with 'none' → 'off' mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The halo's purpose is to surface KV-offload usage; restricting it to
Pareto-frontier-only points hid the indicator on most runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b300-p1 (and similar) artifacts were skipping ingest because the runner-pool
suffix wasn't in the strip list and didn't normalize to the canonical b300
GPU key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Label text now includes `C=<conc>` alongside the GPU/parallelism tag
  (default `<tp> C=<conc>`, advanced `<getPointLabel> C=<conc>`)
- Bumped point-label font-weight to 700 so the labels read clearly against
  the chart fill
- Greedy collision-avoidance pass on render and zoom: tries placing each
  label above/below the point through 4 candidate dy offsets, hiding the
  label only when no slot is free

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oint

Tspans now ride above the text's `dy` anchor — the LAST line sits at the
anchor (just above the point) and earlier lines stack above it. Previously
the second tspan landed below the anchor and crashed into the marker.

Also widened collision candidates by label height so the flipped-below
position fully clears the point on multi-line labels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pass

When a `<text>` contains tspans, the parent's `dy` does not shift the bbox
cleanly — its (unused) y=0 origin still factors in, so the rendered text
ended up centered on the point. Move the absolute offset into the FIRST
tspan's `dy`; later tspans cascade by 1.1em.

Collision avoidance now drives the first tspan's `dy` and tries four
candidate baselines (primary above, primary below, secondary above,
secondary below), accounting for full label height when picking a non-
overlapping slot. Labels still hidden as a last resort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary fixes for runs whose `results_bmk` aggregated artifact
ends up containing both a successful row and a failed-attempt row for the
same (config, conc, offload) — the failed row's null metrics were
overwriting the good row via ON CONFLICT DO UPDATE.

1. Artifact-level: strip the trailing `_<runner-pool>_<attempt>` suffix
   from each artifact name and group by the logical name, keeping only the
   most recent per group.

2. Row-level: skip rows with `num_requests_successful === 0` AND
   `num_requests_total > 0`. The aggregated artifact merges rows from all
   runners — including failed ones — so artifact-level dedup alone can't
   reach inside it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	packages/app/src/components/GlobalFilterContext.tsx
#	packages/app/src/components/inference/utils/tooltipUtils.ts
#	packages/db/src/etl/normalizers.ts
Tag display name for the `aiperf` spec_method suffix used by the
alternate-harness runs ingested for the agentic minimax sweep.
Without this entry the legend shows 'AIPERF' from the default
toUpperCase fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bigint workflow_run_id sometimes deserializes as a number on the
frontend depending on the postgres adapter's behavior; strict ===
between a number and a string silently dropped every match, so the
changelog popover always reported "no changelog data available."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the selected model has agentic_traces data, prefer that over the
default 8K/1K fixed-seq when the user hasn't explicitly chosen via URL.
effectiveSequence already falls back to availableSequences[0] for models
without agentic, so models with only fixed-seq data still render correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented May 14, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment Jun 23, 2026 3:35am

Request Review

# Conflicts:
#	packages/app/src/components/inference/ui/ChartControls.tsx
#	packages/app/src/components/inference/utils/tooltipUtils.ts
#	packages/db/src/etl/normalizers.ts
rowToAggDataEntry was only copying median/p99 metric variants — picking
p90/p99.9 in the percentile selector silently fell back to 0 and
collapsed every point into a vertical line at x=0. Copy the full
median/p90/p99/p99.9 set into AggDataEntry.

Hide the X-Axis Metric dropdown for agentic mode (it doubled up with the
percentile selector) and route the input-metric chart through
withPercentile so picking p99 actually plots p99_ttft instead of the
hard-coded p99_ttft config default. Percentile options pared back to
median + p99.
cquil11 added 2 commits May 15, 2026 12:30
# Conflicts:
#	packages/app/src/components/GlobalFilterContext.tsx
#	packages/app/src/components/inference/InferenceContext.tsx
#	packages/app/src/components/inference/hooks/useChartData.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the TTFT x-axis selectors with the percentile selector — only
p90 is offered everywhere. Default x-axis metric and chart config
input-throughput x are p90_ttft.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `!isAgentic` gate on the e2e TTFT override branch dropped the
user's `p90_ttft` pick in agentic mode, leaving the chart on the
default p90_e2el. The trailing withPercentile pass is idempotent
when xAxisField is already at the right percentile, so the gate is
unnecessary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…indow

Two fixes to the conversation/request-timeline view:

1. The Profiling vs 'All (incl. warmup)' toggle never did anything —
   aiperf's profile_export only contains profiling-phase requests, so
   every stored record has phase='profiling' (verified: 297k/297k rows).
   Hide the toggle unless a non-profiling request actually exists, so it
   reappears and works only if warmup is ever exported.

2. The timeline grew to fit every conversation/worker, making the card
   arbitrarily tall. Cap the body at a fixed height (480px) and scroll
   the rows vertically inside it. Few-row runs still size to content
   (no empty space); the label column and bars scroll together since
   they share the one scroll container.

Verified live on a 3475-request point: phase toggle absent, row-mode
toggle still present, window clientHeight 480 with ~3745px scrolling
inside.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t zoom

The fixed-height window put the chart's horizontal scrollbar at the
bottom of the tall (full-height) content, below the fold and unreachable.
Make the window itself the single scroll container (overflow-auto, both
axes) and pin the label column with position:sticky left-0, so the
horizontal scrollbar stays at the window's bottom edge while the label
column stays put during horizontal scroll and scrolls with the rows
vertically.

Also add double-click anywhere on the timeline to reset zoom/pan (same
resetZoom the existing button calls) and note it in the hint text.

Verified live: window scrollW 1280 > clientW 879 (h-scroll present and
working), label column sticky, rows scroll vertically.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dashed offload-mode ring (drawn in ScatterGraph's onRender for every
point with offload_mode='on') was missing from GPU compare mode
(GPUGraph), so the CPU-offloading indicator never appeared there. Mirror
it in GPUGraph's onRender — same dashed var(--foreground) ring at
POINT_SIZE+4, appended inside each .dot-group so it travels with the
point on zoom/pan.

Verified live in compare mode (DSv4 B200/B300 agentic): offload points
now render the dashed halo (5 rings, r=7.5, dash 3 2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
generateHighContrastColors clamps each vendor's series into its brand hue
zone (NVIDIA=green, AMD=red) at <=PREFERRED_MAX items. The point of that
clamp is to keep DIFFERENT vendors apart at a glance — but when only one
vendor is present (the common all-NVIDIA agentic comparison: B200/B300 x
vLLM/SGLang), there's no rival to separate from, so every series collapses
into the same narrow green band and high-contrast mode looks like it does
nothing.

When a single vendor is present, skip the brand zone and rival-ban and use
the full hue wheel for maximum separation. Verified on an all-NVIDIA
agentic view: HC now spreads pink/blue/gold/green (hues 45/99/227/330,
min adjacent gap 54deg) instead of four near-identical greens. Multi-vendor
behavior is unchanged — vendors keep their brand zones so they stay
distinguishable. The non-HC palette still carries vendor identity.

Updated the single-vendor color tests to assert separability across the
full wheel rather than brand-zone confinement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Conflict resolutions:
- ScatterGraph: adopt master's interactionRef refactor (visibility/color/
  shape read via interactionRef.current so toggles restyle without a
  rebuild); keep agentx's Sequence import, pointMatchesIssue from master,
  per-point 'C=<conc>' labels, and the traceAvailability tooltip dep
  (selectedPrecisions now flows through interactionRef, not the deps array).
- GPUGraph: master's !showPointLabels (state renamed from hidePointLabels)
  with agentx's 'C=<conc>' label text.
- ChartDisplay: keep agentx's view-mode toggle + MP4/replay export +
  x-axis-mode e2e heading (session-time / prefill-tps); the resolved
  onExportCsv + caption already carry master's MetricAssumptionNotes,
  UnofficialDomainNotice, richer precision/sequence subtitle, and
  knownIssueCsvNote CSV notes. Dropped master's now-superseded
  E2eXAxisDropdown variant.
- chart-legend: keep both LegendSwitchConfig additions (infoTooltip +
  advanced).
- GlobalFilterContext: master's new precisionCurveCounts used the removed
  islOslToSequence; switch to agentx's rowToSequence (handles agentic
  rows' null isl/osl), matching the other call sites.
- db/package.json: master's @types/node 25.9.3 + coverage-v8 4.1.9, keep
  agentx's @types/stream-json; pnpm-lock regenerated.
- decoration.test: stub useTraceAvailability so master's decoration test
  doesn't need a QueryClientProvider for agentx's tooltip query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ontrast on

Change the inference chart's default toggle states:
- Line Labels: on -> off  (i_linelabel=1 overrides on)
- Parallelism Labels: off -> on, which also defaults point labels on since
  parallelism labels ARE point labels (i_advlabel=0 overrides off)
- High Contrast: off -> on, via a new opt-in defaultHighContrast on
  useChartUIState so reliability/evaluation (r_/e_ prefixes) stay off;
  i_hc=0 overrides off. Historical trends shares the inference context so
  it inherits the high-contrast default too.

URL serialization flipped to omit each param at its new default and only
write the override value, so share links stay clean. Updated line-labels,
gradient-labels, and url-params E2E specs to the new defaults.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bling chips

The agentic detail page's sibling navigator labeled configs with an ad-hoc
`TP{n}EP{n}` / `{p}P+{d}D` scheme that ignored dp-attention and the
TEP/DEP collapse, so a DEP4 config read as plain TP4EP4 (and, mid-deploy
before the API carried dp_attention, as TEP4).

Extract the scatter chart's labeler into a shared parallelism-label module
(configSegmentLabel + parallelismLabel) and route both getPointLabel and the
sibling chipLabel through it, so the two surfaces describe a config
identically (TP/EP/TEP/DEP/DPA…, multinode-disagg worker segments).

Carry the fields the labeler needs through the siblings query/API/hook:
decode/prefill dp_attention + num_workers + is_multinode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a 'Sort by' dropdown to the agentic detail page's point navigator:
- Default (DB order)
- Concurrency ↑
- Parallelism (groups all TP, then TEP/DEP/EP… by ep→tp→dpa, conc within)
- Throughput/GPU ↓
- Total requests ↓

Carry tput_per_gpu and total_requests (total_requests_completed, falling
back to legacy num_requests_total) through the siblings query/API/hook.

prev/next follow the sorted order, and the chosen sort is persisted in the
URL (?sort=) — read on mount and threaded through every point link plus a
router.replace — so navigating to another point no longer resets it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cquil11 and others added 13 commits June 22, 2026 15:49
Additive migration backing the new /datasets area: a registry of ingested
HF cc-traces-weka dataset versions (summary + precomputed chart_data) and one
row per conversation holding a flamegraph-ready structure JSONB. Drop snippet
in the migration header for revert.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure transforms (no DB) turning a raw cc-traces-weka conversation into a
flamegraph-ready structure: ordered turn/subagent nodes with input split into
cached-prefix vs uncached-suffix. Ports _count_seen_prefix_blocks from the
aiperf weka loader; subagents run against a spawn-time snapshot of the parent
prefix cache. Includes linear/log histogram helpers for the detail cards and
13 unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pages the HF datasets-server rows API (adaptive page length for the ~3.5MB
rows), builds the flamegraph structure + cached-prefix split per conversation,
accumulates dataset-level distributions (input/output length, turns/conv,
subagent fan-out, cached fraction) into datasets.chart_data, and upserts
datasets + dataset_conversations. DATABASE_WRITE_URL must be provided. Verified
the cached split against a hand computation on raw hash_ids.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Retry 429/5xx with exponential backoff (honoring Retry-After) instead of
shrinking page size, plus a 400ms inter-page delay. Lets the full 393-row
ingest complete without tripping the datasets-server rate limit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
queries/datasets.ts: listDatasets, getDataset (incl chart_data),
listConversations (paginated, searchable, 4 sort modes — separate per-sort
queries since the neon HTTP driver can't compose order-by fragments),
getConversation (flamegraph structure). Routes under /api/v1/datasets/* with
cachedQuery + gzip cachedJson. Hooks use-datasets.ts mirror the existing
benchmark-siblings hook style. Verified all four routes against the live branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- /datasets: methodology prose + dataset registry cards (DatasetList)
- /datasets/[slug]: summary stats, model mix, 5 precomputed-histogram
  distribution cards (DistributionCard, log/linear), and a
  searchable/sortable/paginated conversation table
- /datasets/[slug]/conversations/[convId]: per-conversation TraceFlamegraph —
  one bar per turn (cached prefix + uncached input + output), subagent groups
  collapsible (collapsed by default) with expand/collapse-all
- header nav 'Datasets' link
- query-layer test (mock DbClient): not-found paths + numeric coercion

Verified end-to-end against the live branch DB: both datasets list with real
stats, distributions render, flamegraph shows the prefix-reuse signature
(turn 2 fully uncached, later turns mostly cached), expand-all surfaces
subagent subturns. Zero console errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap rows in a fixed-height (max-h-[520px]) vertically scrollable bordered box.
Subagent group headers carry aggregate token totals that dwarf any single turn,
which made their bars overflow the row (width >> 100%). Now turns/subturns use a
per-turn scale while group headers use a separate group-aggregate scale (slim
muted strips), both clamped to the track — groups stay comparable to each other
and nothing overflows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add run_datasets (workflow_run → dataset slug) mapping (migration 012) and
surface it through the benchmark-siblings sku. The agentic detail page's request
timeline now deep-links each request bar to its exact conversation in the
/datasets viewer — the request cid, stripped of any ::sa:/::fa: suffix, is the
dataset conv_id. Tooltip shows a 'click to view in dataset' hint; bars get a
pointer cursor only when a mapping exists. Backfilled workflow_run 27915787191
(the dsv4/b300/vllm run incl. point 422083) → cc-traces-weka-062126.

Verified: clicking a timeline bar on /inference/agentic/422083 navigates to the
matching /datasets/cc-traces-weka-062126/conversations/<conv_id>.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The timeline link now carries ?turn=<ti> (and &sa=<agentId> for subagent
requests). The flamegraph resolves the target node — main turns by ordinal,
subagent turns by matching the group's agentId then the ti-th child — expands
the subagent group if needed, scrolls the row into view, and flashes a ring.

subagentIdOf strips the harness stream suffix (:s<n> and :aux:<n>) so the cid's
agent id matches the dataset SubagentNode.agentId. Verified end-to-end: clicking
a subagent bar on /inference/agentic/422083 opens the conversation, expands the
right group, and highlights the exact subturn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ooltip

- Deep-link highlight is now state-driven (bg-primary/20 + ring, fades over
  700ms) instead of fragile classList mutation, so it's clearly visible and
  survives re-renders. Subagent groups still auto-expand and scroll into view.
- Portal the hover tooltip to document.body so its position:fixed is
  viewport-relative — an ancestor transform was offsetting it away from the
  cursor. Now it sits at pointer+12px.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The conversation page read ?turn/&sa from window.location.search in a useState
initializer, which captures stale/empty params during a client-side navigation —
so scroll+highlight+expand only worked after a manual reload. Switch to the
reactive useSearchParams (page wrapped in Suspense) so the params are present on
the first nav. Also make the flamegraph expand the target subagent group via an
effect (reacting to target changes), and defer the scroll one frame so the
just-expanded child row exists. Verified via a real timeline click — no reload.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In HC mode the iwanthue palette is sized and indexed by the key set it's
generated over. ScatterGraph generated it from the *active* (selected) hw set,
so deselecting a line shrank the set, re-sized the palette, and shifted every
remaining line's hue — most visible on single-vendor agentic runs (which span
the full hue wheel since 2c06009), where deselecting B300 could recolor B200
from red to blue.

Pass the stable full set of hw-types-with-data as hcKeys so the palette and
per-key index are fixed; toggling now only hides/shows lines without recoloring
the rest. Adds a useThemeColors regression test asserting a line's HC color is
identical across active subsets when hcKeys is the full set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants