Skip to content

Add: Insight Trace workspace generation for MindStudio profiling#821

Open
vegetabledoww wants to merge 2 commits into
hw-native-sys:mainfrom
vegetabledoww:2b
Open

Add: Insight Trace workspace generation for MindStudio profiling#821
vegetabledoww wants to merge 2 commits into
hw-native-sys:mainfrom
vegetabledoww:2b

Conversation

@vegetabledoww
Copy link
Copy Markdown

@vegetabledoww vegetabledoww commented May 19, 2026

Implement Insight Trace feature to generate MindStudio Insight-compatible trace data (trace.json + visualize_data.bin) for incore kernel instruction-level profiling. Supports simpler and ptoas dual backends.

Summary

Overview

This PR adds a built-in Insight Trace workflow to simpler_setup. It lets users select a simpler incore kernel and generate profiling artifacts directly consumable by MindStudio Insight.

New command:

python -m simpler_setup.tools.insight_trace

Typical usage:

python -m simpler_setup.tools.insight_trace \
  examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py \
  --case CaseSmall1 \
  --kernel SF

MindStudio Insight output

The feature exports the simulator trace package:

<workspace>/insight_export/OPPROF_*/simulator/

MindStudio Insight opens:

<workspace>/insight_export/OPPROF_*/simulator/trace.json

trace.json is the UI entry point, but the practical consumption unit is the whole simulator/ directory, including visualize_data.bin, core*.*/trace.json, and core*/*_instr_exe_*.csv.

What changed

Added simpler_setup/insight_trace/, which supports:

  • loading SceneTestCase modules and cases;
  • selecting kernels by --kernel, --func-id, or --kernel-source;
  • classifying kernels as AIC-only, AIV-only, or SPMD mix;
  • resolving replay args from recipes or --arg-spec;
  • generating a standalone replay workspace;
  • running msprof op simulator collect/export;
  • validating exported MindStudio Insight artifacts.

Added CLI shim:

simpler_setup/tools/insight_trace.py

Generated simpler replay workspace:

replay_kernel.cpp
replay_launch.cpp
replay_host.cpp
CMakeLists.txt
run_collect.sh
insight_trace_config.json

The generated host runner allocates replay tensors, builds Tensor metadata, packs a 50-slot args array, launches replay_entry, and synchronizes.

Initial kernel support

Built-in recipes cover paged attention incore kernels:

QK, SF, PV, UP

For CaseSmall1 + SF:

args[0] sij   FLOAT32  [16, 16]
args[1] pij   BFLOAT16 [16, 16]
args[2] mij   FLOAT32  [16]
args[3] lij   FLOAT32  [16]
args[4] scale FLOAT32_BITS 1065353216

PTOAS backend

Adds PTOAS backend plumbing for PTOAS-generated kernel C++ sources: call PTOAS generate_testcase.py, build the generated simulator runner, resolve the real exported kernel symbol with nm -D + c++filt, run msprof op simulator, and export the same MindStudio Insight simulator/ artifact shape.

Validation hardening

The implementation rejects invalid replay arg specs: negative indices, indices outside the 50-slot args array, and duplicate indices. It also packs float scalar values as IEEE 754 bits when pack_mode="bits" or dtype="FLOAT32_BITS" is used.

Other hardening:

  • cache repeated kernel source reads;
  • demangle PTOAS symbols with one c++filt call;
  • fail clearly if PTOAS golden.py is missing or fails.

Tests

Added tests:

tests/ut/py/test_insight_trace_core.py

Validated with:

python -m pytest tests/ut/py/test_insight_trace_core.py -v
# 6 passed
python -m compileall -q simpler_setup/insight_trace \
  simpler_setup/tools/insight_trace.py tests/ut/py/test_insight_trace_core.py
# passed

Scope

Changed files:

simpler_setup/insight_trace/*
simpler_setup/tools/insight_trace.py
tests/ut/py/test_insight_trace_core.py

Overall diff:

12 files changed, 1302 insertions(+)

Output: outputs/insight_trace_*/insight_export/OPPROF_*/simulator/ (drag trace.json into MindStudio Insight).

Implement Insight Trace feature to generate MindStudio Insight-compatible
trace data (trace.json + visualize_data.bin) for incore kernel instruction-level
profiling. Supports simpler and ptoas dual backends.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a toolset for generating MindStudio Insight trace data for incore kernels, featuring both 'simpler' and 'ptoas' backends. Key components include a CLI, argument resolution recipes for paged attention, kernel classification logic, and workspace generation via C++ templates. Feedback identifies a bug in scalar argument bit-packing and a potential out-of-bounds access in the generated host code. Additionally, improvements were suggested for optimizing file I/O and external process calls, along with the removal of incomplete code in the PTOAS backend.

Comment on lines +47 to +56
elif item["kind"] == "scalar":
result.append(
TraceScalarArg(
index=int(item["index"]),
name=item["name"],
dtype=item["dtype"],
value=item["value"],
pack_mode=item.get("pack_mode", "value"),
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When loading scalar arguments from a JSON specification, float values intended to be passed as bit patterns (e.g., for FLOAT32 kernel arguments) are currently truncated to integers by the template renderer. If pack_mode is set to "bits", the float value should be converted to its IEEE 754 bit representation here to ensure the generated C++ code receives the correct data.

Suggested change
elif item["kind"] == "scalar":
result.append(
TraceScalarArg(
index=int(item["index"]),
name=item["name"],
dtype=item["dtype"],
value=item["value"],
pack_mode=item.get("pack_mode", "value"),
)
)
elif item["kind"] == "scalar":
val = item["value"]
pack_mode = item.get("pack_mode", "value")
if pack_mode == "bits" and isinstance(val, float):
val = _f32_bits(val)
result.append(
TraceScalarArg(
index=int(item["index"]),
name=item["name"],
dtype=item["dtype"],
value=val,
pack_mode=pack_mode,
)
)

Comment on lines +78 to +80
def render_host(config: TraceConfig) -> str:
tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)]
scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The generated C++ host code uses a fixed-size array std::array<int64_t, kArgsSlots> args where kArgsSlots is 50. There is currently no validation to ensure that argument indices provided in the configuration do not exceed this limit, which could lead to out-of-bounds access in the generated code.

Suggested change
def render_host(config: TraceConfig) -> str:
tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)]
scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)]
def render_host(config: TraceConfig) -> str:
for arg in config.args:
if arg.index >= 50:
raise ValueError(f"Argument index {arg.index} exceeds maximum supported slots (50)")
tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)]
scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)]

Comment on lines +9 to +16
from __future__ import annotations

import re
from pathlib import Path

from .models import KernelShape, KernelSpec, SceneCaseContext

_ARG_READ_RE = re.compile(r"args\[(\d+)\]")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The kernel source file is read from disk multiple times across different functions (classify_kernel, read_arg_indices, and validate_single_task_kernel). Following the general rule to avoid redundant I/O, consider using functools.lru_cache on a helper function to read the file content once and reuse it.

from __future__ import annotations

import re
from functools import lru_cache
from pathlib import Path

from .models import KernelShape, KernelSpec, SceneCaseContext

_ARG_READ_RE = re.compile(r"args\[(\d+)\]")

@lru_cache(maxsize=8)
def _read_source(path: Path) -> str:
    return path.read_text()
References
  1. Avoid redundant I/O and parsing of large files by loading the data once and passing the parsed object to downstream functions.

Comment on lines +130 to +138
candidates = []
for line in result.stdout.splitlines():
parts = line.split()
if len(parts) < 3 or parts[-2] not in {"T", "W"}:
continue
symbol = parts[-1]
demangled = subprocess.run(["c++filt", symbol], check=False, capture_output=True, text=True).stdout.strip()
if demangled.startswith(f"{kernel_base_name}("):
candidates.append(symbol)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling c++filt in a loop for every symbol found in the library is inefficient as it spawns a new process for each iteration. It is better to batch all symbols and pass them to a single c++filt process via standard input. Additionally, before using zip on the symbols and demangled names, verify their lengths are identical to prevent silent truncation.

    symbols = []
    for line in result.stdout.splitlines():
        parts = line.split()
        if len(parts) >= 3 and parts[-2] in {"T", "W"}:
            symbols.append(parts[-1])
    if not symbols:
        raise ValueError(f"No symbols found in {kernel_lib}")

    demangle_proc = subprocess.run(
        ["c++filt"], input="\n".join(symbols), capture_output=True, text=True, check=True
    )
    demangled_names = demangle_proc.stdout.splitlines()
    if len(symbols) != len(demangled_names):
        raise ValueError(f"Mismatched symbol count: {len(symbols)} symbols, {len(demangled_names)} demangled names")
    candidates = [
        sym for sym, demangled in zip(symbols, demangled_names)
        if demangled.startswith(f"{kernel_base_name}(")
    ]
References
  1. Before using zip on sequences that are expected to be of equal length, verify their lengths are identical and raise an error on mismatch to prevent silent truncation.

Comment on lines +150 to +156
main_cpp = case_dir / "main.cpp"
if not main_cpp.is_file():
return
for line in main_cpp.read_text().splitlines():
if ".bin" not in line or "fileSize_" not in line:
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code appears to be incomplete or dead code. It iterates over lines in main.cpp but performs no action. If this logic is not yet implemented, it should be removed or replaced with a TODO.

Tighten generated argument handling and PTOAS symbol/input processing so invalid replay specs fail early and float bit scalars preserve their intended encoding.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant