[WIP] agentx#348
Draft
cquil11 wants to merge 92 commits into
Draft
Conversation
Adds agentic_traces scenario end-to-end: - Schema migrations for agentic scenario, availability, and KV offload mode - DB ingest/ETL + query updates to carry scenario, offload_mode, and server/theoretical cache-hit rates through to the API layer - Frontend types, filters (GlobalFilterContext / InferenceContext / ChartControls), URL state, and tooltip rows for agentic-only fields - ScatterGraph: subtle dashed halo on Pareto-frontier points that used KV offload so the tradeoff is visible at a glance
- ScatterGraph: include `offload_mode` in `buildPointConfigId` so d3's data join keeps both `on` and `off` variants for the same (config, conc). Without it, the second variant collapsed onto the first key, so FP8 offload-on points (and their halos) silently disappeared. - benchmark-mapper: handle older artifacts that emit `users`/`offload_mode` AND newer ones that emit `conc`/`offloading` (with 'none' → 'off' mapping). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The halo's purpose is to surface KV-offload usage; restricting it to Pareto-frontier-only points hid the indicator on most runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b300-p1 (and similar) artifacts were skipping ingest because the runner-pool suffix wasn't in the strip list and didn't normalize to the canonical b300 GPU key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Label text now includes `C=<conc>` alongside the GPU/parallelism tag (default `<tp> C=<conc>`, advanced `<getPointLabel> C=<conc>`) - Bumped point-label font-weight to 700 so the labels read clearly against the chart fill - Greedy collision-avoidance pass on render and zoom: tries placing each label above/below the point through 4 candidate dy offsets, hiding the label only when no slot is free Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oint Tspans now ride above the text's `dy` anchor — the LAST line sits at the anchor (just above the point) and earlier lines stack above it. Previously the second tspan landed below the anchor and crashed into the marker. Also widened collision candidates by label height so the flipped-below position fully clears the point on multi-line labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pass When a `<text>` contains tspans, the parent's `dy` does not shift the bbox cleanly — its (unused) y=0 origin still factors in, so the rendered text ended up centered on the point. Move the absolute offset into the FIRST tspan's `dy`; later tspans cascade by 1.1em. Collision avoidance now drives the first tspan's `dy` and tries four candidate baselines (primary above, primary below, secondary above, secondary below), accounting for full label height when picking a non- overlapping slot. Labels still hidden as a last resort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary fixes for runs whose `results_bmk` aggregated artifact ends up containing both a successful row and a failed-attempt row for the same (config, conc, offload) — the failed row's null metrics were overwriting the good row via ON CONFLICT DO UPDATE. 1. Artifact-level: strip the trailing `_<runner-pool>_<attempt>` suffix from each artifact name and group by the logical name, keeping only the most recent per group. 2. Row-level: skip rows with `num_requests_successful === 0` AND `num_requests_total > 0`. The aggregated artifact merges rows from all runners — including failed ones — so artifact-level dedup alone can't reach inside it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # packages/app/src/components/GlobalFilterContext.tsx # packages/app/src/components/inference/utils/tooltipUtils.ts # packages/db/src/etl/normalizers.ts
Tag display name for the `aiperf` spec_method suffix used by the alternate-harness runs ingested for the agentic minimax sweep. Without this entry the legend shows 'AIPERF' from the default toUpperCase fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bigint workflow_run_id sometimes deserializes as a number on the frontend depending on the postgres adapter's behavior; strict === between a number and a string silently dropped every match, so the changelog popover always reported "no changelog data available." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the selected model has agentic_traces data, prefer that over the default 8K/1K fixed-seq when the user hasn't explicitly chosen via URL. effectiveSequence already falls back to availableSequences[0] for models without agentic, so models with only fixed-seq data still render correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
# Conflicts: # packages/app/src/components/inference/ui/ChartControls.tsx # packages/app/src/components/inference/utils/tooltipUtils.ts # packages/db/src/etl/normalizers.ts
rowToAggDataEntry was only copying median/p99 metric variants — picking p90/p99.9 in the percentile selector silently fell back to 0 and collapsed every point into a vertical line at x=0. Copy the full median/p90/p99/p99.9 set into AggDataEntry. Hide the X-Axis Metric dropdown for agentic mode (it doubled up with the percentile selector) and route the input-metric chart through withPercentile so picking p99 actually plots p99_ttft instead of the hard-coded p99_ttft config default. Percentile options pared back to median + p99.
# Conflicts: # packages/app/src/components/GlobalFilterContext.tsx # packages/app/src/components/inference/InferenceContext.tsx # packages/app/src/components/inference/hooks/useChartData.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the TTFT x-axis selectors with the percentile selector — only p90 is offered everywhere. Default x-axis metric and chart config input-throughput x are p90_ttft. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `!isAgentic` gate on the e2e TTFT override branch dropped the user's `p90_ttft` pick in agentic mode, leaving the chart on the default p90_e2el. The trailing withPercentile pass is idempotent when xAxisField is already at the right percentile, so the gate is unnecessary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…indow Two fixes to the conversation/request-timeline view: 1. The Profiling vs 'All (incl. warmup)' toggle never did anything — aiperf's profile_export only contains profiling-phase requests, so every stored record has phase='profiling' (verified: 297k/297k rows). Hide the toggle unless a non-profiling request actually exists, so it reappears and works only if warmup is ever exported. 2. The timeline grew to fit every conversation/worker, making the card arbitrarily tall. Cap the body at a fixed height (480px) and scroll the rows vertically inside it. Few-row runs still size to content (no empty space); the label column and bars scroll together since they share the one scroll container. Verified live on a 3475-request point: phase toggle absent, row-mode toggle still present, window clientHeight 480 with ~3745px scrolling inside. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t zoom The fixed-height window put the chart's horizontal scrollbar at the bottom of the tall (full-height) content, below the fold and unreachable. Make the window itself the single scroll container (overflow-auto, both axes) and pin the label column with position:sticky left-0, so the horizontal scrollbar stays at the window's bottom edge while the label column stays put during horizontal scroll and scrolls with the rows vertically. Also add double-click anywhere on the timeline to reset zoom/pan (same resetZoom the existing button calls) and note it in the hint text. Verified live: window scrollW 1280 > clientW 879 (h-scroll present and working), label column sticky, rows scroll vertically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dashed offload-mode ring (drawn in ScatterGraph's onRender for every point with offload_mode='on') was missing from GPU compare mode (GPUGraph), so the CPU-offloading indicator never appeared there. Mirror it in GPUGraph's onRender — same dashed var(--foreground) ring at POINT_SIZE+4, appended inside each .dot-group so it travels with the point on zoom/pan. Verified live in compare mode (DSv4 B200/B300 agentic): offload points now render the dashed halo (5 rings, r=7.5, dash 3 2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
generateHighContrastColors clamps each vendor's series into its brand hue zone (NVIDIA=green, AMD=red) at <=PREFERRED_MAX items. The point of that clamp is to keep DIFFERENT vendors apart at a glance — but when only one vendor is present (the common all-NVIDIA agentic comparison: B200/B300 x vLLM/SGLang), there's no rival to separate from, so every series collapses into the same narrow green band and high-contrast mode looks like it does nothing. When a single vendor is present, skip the brand zone and rival-ban and use the full hue wheel for maximum separation. Verified on an all-NVIDIA agentic view: HC now spreads pink/blue/gold/green (hues 45/99/227/330, min adjacent gap 54deg) instead of four near-identical greens. Multi-vendor behavior is unchanged — vendors keep their brand zones so they stay distinguishable. The non-HC palette still carries vendor identity. Updated the single-vendor color tests to assert separability across the full wheel rather than brand-zone confinement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Conflict resolutions: - ScatterGraph: adopt master's interactionRef refactor (visibility/color/ shape read via interactionRef.current so toggles restyle without a rebuild); keep agentx's Sequence import, pointMatchesIssue from master, per-point 'C=<conc>' labels, and the traceAvailability tooltip dep (selectedPrecisions now flows through interactionRef, not the deps array). - GPUGraph: master's !showPointLabels (state renamed from hidePointLabels) with agentx's 'C=<conc>' label text. - ChartDisplay: keep agentx's view-mode toggle + MP4/replay export + x-axis-mode e2e heading (session-time / prefill-tps); the resolved onExportCsv + caption already carry master's MetricAssumptionNotes, UnofficialDomainNotice, richer precision/sequence subtitle, and knownIssueCsvNote CSV notes. Dropped master's now-superseded E2eXAxisDropdown variant. - chart-legend: keep both LegendSwitchConfig additions (infoTooltip + advanced). - GlobalFilterContext: master's new precisionCurveCounts used the removed islOslToSequence; switch to agentx's rowToSequence (handles agentic rows' null isl/osl), matching the other call sites. - db/package.json: master's @types/node 25.9.3 + coverage-v8 4.1.9, keep agentx's @types/stream-json; pnpm-lock regenerated. - decoration.test: stub useTraceAvailability so master's decoration test doesn't need a QueryClientProvider for agentx's tooltip query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ontrast on Change the inference chart's default toggle states: - Line Labels: on -> off (i_linelabel=1 overrides on) - Parallelism Labels: off -> on, which also defaults point labels on since parallelism labels ARE point labels (i_advlabel=0 overrides off) - High Contrast: off -> on, via a new opt-in defaultHighContrast on useChartUIState so reliability/evaluation (r_/e_ prefixes) stay off; i_hc=0 overrides off. Historical trends shares the inference context so it inherits the high-contrast default too. URL serialization flipped to omit each param at its new default and only write the override value, so share links stay clean. Updated line-labels, gradient-labels, and url-params E2E specs to the new defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bling chips
The agentic detail page's sibling navigator labeled configs with an ad-hoc
`TP{n}EP{n}` / `{p}P+{d}D` scheme that ignored dp-attention and the
TEP/DEP collapse, so a DEP4 config read as plain TP4EP4 (and, mid-deploy
before the API carried dp_attention, as TEP4).
Extract the scatter chart's labeler into a shared parallelism-label module
(configSegmentLabel + parallelismLabel) and route both getPointLabel and the
sibling chipLabel through it, so the two surfaces describe a config
identically (TP/EP/TEP/DEP/DPA…, multinode-disagg worker segments).
Carry the fields the labeler needs through the siblings query/API/hook:
decode/prefill dp_attention + num_workers + is_multinode.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a 'Sort by' dropdown to the agentic detail page's point navigator: - Default (DB order) - Concurrency ↑ - Parallelism (groups all TP, then TEP/DEP/EP… by ep→tp→dpa, conc within) - Throughput/GPU ↓ - Total requests ↓ Carry tput_per_gpu and total_requests (total_requests_completed, falling back to legacy num_requests_total) through the siblings query/API/hook. prev/next follow the sorted order, and the chosen sort is persisted in the URL (?sort=) — read on mount and threaded through every point link plus a router.replace — so navigating to another point no longer resets it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Additive migration backing the new /datasets area: a registry of ingested HF cc-traces-weka dataset versions (summary + precomputed chart_data) and one row per conversation holding a flamegraph-ready structure JSONB. Drop snippet in the migration header for revert. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure transforms (no DB) turning a raw cc-traces-weka conversation into a flamegraph-ready structure: ordered turn/subagent nodes with input split into cached-prefix vs uncached-suffix. Ports _count_seen_prefix_blocks from the aiperf weka loader; subagents run against a spawn-time snapshot of the parent prefix cache. Includes linear/log histogram helpers for the detail cards and 13 unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pages the HF datasets-server rows API (adaptive page length for the ~3.5MB rows), builds the flamegraph structure + cached-prefix split per conversation, accumulates dataset-level distributions (input/output length, turns/conv, subagent fan-out, cached fraction) into datasets.chart_data, and upserts datasets + dataset_conversations. DATABASE_WRITE_URL must be provided. Verified the cached split against a hand computation on raw hash_ids. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Retry 429/5xx with exponential backoff (honoring Retry-After) instead of shrinking page size, plus a 400ms inter-page delay. Lets the full 393-row ingest complete without tripping the datasets-server rate limit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
queries/datasets.ts: listDatasets, getDataset (incl chart_data), listConversations (paginated, searchable, 4 sort modes — separate per-sort queries since the neon HTTP driver can't compose order-by fragments), getConversation (flamegraph structure). Routes under /api/v1/datasets/* with cachedQuery + gzip cachedJson. Hooks use-datasets.ts mirror the existing benchmark-siblings hook style. Verified all four routes against the live branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- /datasets: methodology prose + dataset registry cards (DatasetList) - /datasets/[slug]: summary stats, model mix, 5 precomputed-histogram distribution cards (DistributionCard, log/linear), and a searchable/sortable/paginated conversation table - /datasets/[slug]/conversations/[convId]: per-conversation TraceFlamegraph — one bar per turn (cached prefix + uncached input + output), subagent groups collapsible (collapsed by default) with expand/collapse-all - header nav 'Datasets' link - query-layer test (mock DbClient): not-found paths + numeric coercion Verified end-to-end against the live branch DB: both datasets list with real stats, distributions render, flamegraph shows the prefix-reuse signature (turn 2 fully uncached, later turns mostly cached), expand-all surfaces subagent subturns. Zero console errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap rows in a fixed-height (max-h-[520px]) vertically scrollable bordered box. Subagent group headers carry aggregate token totals that dwarf any single turn, which made their bars overflow the row (width >> 100%). Now turns/subturns use a per-turn scale while group headers use a separate group-aggregate scale (slim muted strips), both clamped to the track — groups stay comparable to each other and nothing overflows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add run_datasets (workflow_run → dataset slug) mapping (migration 012) and surface it through the benchmark-siblings sku. The agentic detail page's request timeline now deep-links each request bar to its exact conversation in the /datasets viewer — the request cid, stripped of any ::sa:/::fa: suffix, is the dataset conv_id. Tooltip shows a 'click to view in dataset' hint; bars get a pointer cursor only when a mapping exists. Backfilled workflow_run 27915787191 (the dsv4/b300/vllm run incl. point 422083) → cc-traces-weka-062126. Verified: clicking a timeline bar on /inference/agentic/422083 navigates to the matching /datasets/cc-traces-weka-062126/conversations/<conv_id>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The timeline link now carries ?turn=<ti> (and &sa=<agentId> for subagent requests). The flamegraph resolves the target node — main turns by ordinal, subagent turns by matching the group's agentId then the ti-th child — expands the subagent group if needed, scrolls the row into view, and flashes a ring. subagentIdOf strips the harness stream suffix (:s<n> and :aux:<n>) so the cid's agent id matches the dataset SubagentNode.agentId. Verified end-to-end: clicking a subagent bar on /inference/agentic/422083 opens the conversation, expands the right group, and highlights the exact subturn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ooltip - Deep-link highlight is now state-driven (bg-primary/20 + ring, fades over 700ms) instead of fragile classList mutation, so it's clearly visible and survives re-renders. Subagent groups still auto-expand and scroll into view. - Portal the hover tooltip to document.body so its position:fixed is viewport-relative — an ancestor transform was offsetting it away from the cursor. Now it sits at pointer+12px. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The conversation page read ?turn/&sa from window.location.search in a useState initializer, which captures stale/empty params during a client-side navigation — so scroll+highlight+expand only worked after a manual reload. Switch to the reactive useSearchParams (page wrapped in Suspense) so the params are present on the first nav. Also make the flamegraph expand the target subagent group via an effect (reacting to target changes), and defer the scroll one frame so the just-expanded child row exists. Verified via a real timeline click — no reload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In HC mode the iwanthue palette is sized and indexed by the key set it's generated over. ScatterGraph generated it from the *active* (selected) hw set, so deselecting a line shrank the set, re-sized the palette, and shifted every remaining line's hue — most visible on single-vendor agentic runs (which span the full hue wheel since 2c06009), where deselecting B300 could recolor B200 from red to blue. Pass the stable full set of hw-types-with-data as hcKeys so the palette and per-key index are fixed; toggling now only hides/shows lines without recoloring the rest. Adds a useThemeColors regression test asserting a line's HC color is identical across active subsets when hcKeys is the full set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.