Add: Insight Trace workspace generation for MindStudio profiling#821
Add: Insight Trace workspace generation for MindStudio profiling#821vegetabledoww wants to merge 2 commits into
Conversation
Implement Insight Trace feature to generate MindStudio Insight-compatible trace data (trace.json + visualize_data.bin) for incore kernel instruction-level profiling. Supports simpler and ptoas dual backends.
There was a problem hiding this comment.
Code Review
This pull request introduces a toolset for generating MindStudio Insight trace data for incore kernels, featuring both 'simpler' and 'ptoas' backends. Key components include a CLI, argument resolution recipes for paged attention, kernel classification logic, and workspace generation via C++ templates. Feedback identifies a bug in scalar argument bit-packing and a potential out-of-bounds access in the generated host code. Additionally, improvements were suggested for optimizing file I/O and external process calls, along with the removal of incomplete code in the PTOAS backend.
| elif item["kind"] == "scalar": | ||
| result.append( | ||
| TraceScalarArg( | ||
| index=int(item["index"]), | ||
| name=item["name"], | ||
| dtype=item["dtype"], | ||
| value=item["value"], | ||
| pack_mode=item.get("pack_mode", "value"), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
When loading scalar arguments from a JSON specification, float values intended to be passed as bit patterns (e.g., for FLOAT32 kernel arguments) are currently truncated to integers by the template renderer. If pack_mode is set to "bits", the float value should be converted to its IEEE 754 bit representation here to ensure the generated C++ code receives the correct data.
| elif item["kind"] == "scalar": | |
| result.append( | |
| TraceScalarArg( | |
| index=int(item["index"]), | |
| name=item["name"], | |
| dtype=item["dtype"], | |
| value=item["value"], | |
| pack_mode=item.get("pack_mode", "value"), | |
| ) | |
| ) | |
| elif item["kind"] == "scalar": | |
| val = item["value"] | |
| pack_mode = item.get("pack_mode", "value") | |
| if pack_mode == "bits" and isinstance(val, float): | |
| val = _f32_bits(val) | |
| result.append( | |
| TraceScalarArg( | |
| index=int(item["index"]), | |
| name=item["name"], | |
| dtype=item["dtype"], | |
| value=val, | |
| pack_mode=pack_mode, | |
| ) | |
| ) |
| def render_host(config: TraceConfig) -> str: | ||
| tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)] | ||
| scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)] |
There was a problem hiding this comment.
The generated C++ host code uses a fixed-size array std::array<int64_t, kArgsSlots> args where kArgsSlots is 50. There is currently no validation to ensure that argument indices provided in the configuration do not exceed this limit, which could lead to out-of-bounds access in the generated code.
| def render_host(config: TraceConfig) -> str: | |
| tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)] | |
| scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)] | |
| def render_host(config: TraceConfig) -> str: | |
| for arg in config.args: | |
| if arg.index >= 50: | |
| raise ValueError(f"Argument index {arg.index} exceeds maximum supported slots (50)") | |
| tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)] | |
| scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)] |
| from __future__ import annotations | ||
|
|
||
| import re | ||
| from pathlib import Path | ||
|
|
||
| from .models import KernelShape, KernelSpec, SceneCaseContext | ||
|
|
||
| _ARG_READ_RE = re.compile(r"args\[(\d+)\]") |
There was a problem hiding this comment.
The kernel source file is read from disk multiple times across different functions (classify_kernel, read_arg_indices, and validate_single_task_kernel). Following the general rule to avoid redundant I/O, consider using functools.lru_cache on a helper function to read the file content once and reuse it.
from __future__ import annotations
import re
from functools import lru_cache
from pathlib import Path
from .models import KernelShape, KernelSpec, SceneCaseContext
_ARG_READ_RE = re.compile(r"args\[(\d+)\]")
@lru_cache(maxsize=8)
def _read_source(path: Path) -> str:
return path.read_text()References
- Avoid redundant I/O and parsing of large files by loading the data once and passing the parsed object to downstream functions.
| candidates = [] | ||
| for line in result.stdout.splitlines(): | ||
| parts = line.split() | ||
| if len(parts) < 3 or parts[-2] not in {"T", "W"}: | ||
| continue | ||
| symbol = parts[-1] | ||
| demangled = subprocess.run(["c++filt", symbol], check=False, capture_output=True, text=True).stdout.strip() | ||
| if demangled.startswith(f"{kernel_base_name}("): | ||
| candidates.append(symbol) |
There was a problem hiding this comment.
Calling c++filt in a loop for every symbol found in the library is inefficient as it spawns a new process for each iteration. It is better to batch all symbols and pass them to a single c++filt process via standard input. Additionally, before using zip on the symbols and demangled names, verify their lengths are identical to prevent silent truncation.
symbols = []
for line in result.stdout.splitlines():
parts = line.split()
if len(parts) >= 3 and parts[-2] in {"T", "W"}:
symbols.append(parts[-1])
if not symbols:
raise ValueError(f"No symbols found in {kernel_lib}")
demangle_proc = subprocess.run(
["c++filt"], input="\n".join(symbols), capture_output=True, text=True, check=True
)
demangled_names = demangle_proc.stdout.splitlines()
if len(symbols) != len(demangled_names):
raise ValueError(f"Mismatched symbol count: {len(symbols)} symbols, {len(demangled_names)} demangled names")
candidates = [
sym for sym, demangled in zip(symbols, demangled_names)
if demangled.startswith(f"{kernel_base_name}(")
]References
- Before using zip on sequences that are expected to be of equal length, verify their lengths are identical and raise an error on mismatch to prevent silent truncation.
| main_cpp = case_dir / "main.cpp" | ||
| if not main_cpp.is_file(): | ||
| return | ||
| for line in main_cpp.read_text().splitlines(): | ||
| if ".bin" not in line or "fileSize_" not in line: | ||
| continue | ||
|
|
Tighten generated argument handling and PTOAS symbol/input processing so invalid replay specs fail early and float bit scalars preserve their intended encoding. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Implement Insight Trace feature to generate MindStudio Insight-compatible trace data (trace.json + visualize_data.bin) for incore kernel instruction-level profiling. Supports simpler and ptoas dual backends.
Summary
Overview
This PR adds a built-in Insight Trace workflow to
simpler_setup. It lets users select a simpler incore kernel and generate profiling artifacts directly consumable by MindStudio Insight.New command:
Typical usage:
MindStudio Insight output
The feature exports the simulator trace package:
MindStudio Insight opens:
trace.jsonis the UI entry point, but the practical consumption unit is the wholesimulator/directory, includingvisualize_data.bin,core*.*/trace.json, andcore*/*_instr_exe_*.csv.What changed
Added
simpler_setup/insight_trace/, which supports:SceneTestCasemodules and cases;--kernel,--func-id, or--kernel-source;--arg-spec;msprof op simulatorcollect/export;Added CLI shim:
Generated simpler replay workspace:
The generated host runner allocates replay tensors, builds
Tensormetadata, packs a 50-slotargsarray, launchesreplay_entry, and synchronizes.Initial kernel support
Built-in recipes cover paged attention incore kernels:
For
CaseSmall1 + SF:PTOAS backend
Adds PTOAS backend plumbing for PTOAS-generated kernel C++ sources: call PTOAS
generate_testcase.py, build the generated simulator runner, resolve the real exported kernel symbol withnm -D+c++filt, runmsprof op simulator, and export the same MindStudio Insightsimulator/artifact shape.Validation hardening
The implementation rejects invalid replay arg specs: negative indices, indices outside the 50-slot args array, and duplicate indices. It also packs float scalar values as IEEE 754 bits when
pack_mode="bits"ordtype="FLOAT32_BITS"is used.Other hardening:
c++filtcall;golden.pyis missing or fails.Tests
Added tests:
Validated with:
Scope
Changed files:
Overall diff:
Output:
outputs/insight_trace_*/insight_export/OPPROF_*/simulator/(dragtrace.jsoninto MindStudio Insight).