Add: Insight Trace workspace generation for MindStudio profiling by vegetabledoww · Pull Request #821 · hw-native-sys/simpler

vegetabledoww · 2026-05-19T07:49:04Z

Implement Insight Trace feature to generate MindStudio Insight-compatible trace data (trace.json + visualize_data.bin) for incore kernel instruction-level profiling. Supports simpler and ptoas dual backends.

Summary

Overview

This PR adds a built-in Insight Trace workflow to simpler_setup. It lets users select a simpler incore kernel and generate profiling artifacts directly consumable by MindStudio Insight.

New command:

python -m simpler_setup.tools.insight_trace

Typical usage:

python -m simpler_setup.tools.insight_trace \
  examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py \
  --case CaseSmall1 \
  --kernel SF

MindStudio Insight output

The feature exports the simulator trace package:

<workspace>/insight_export/OPPROF_*/simulator/

MindStudio Insight opens:

<workspace>/insight_export/OPPROF_*/simulator/trace.json

trace.json is the UI entry point, but the practical consumption unit is the whole simulator/ directory, including visualize_data.bin, core*.*/trace.json, and core*/*_instr_exe_*.csv.

What changed

Added simpler_setup/insight_trace/, which supports:

loading SceneTestCase modules and cases;
selecting kernels by --kernel, --func-id, or --kernel-source;
classifying kernels as AIC-only, AIV-only, or SPMD mix;
resolving replay args from recipes or --arg-spec;
generating a standalone replay workspace;
running msprof op simulator collect/export;
validating exported MindStudio Insight artifacts.

Added CLI shim:

simpler_setup/tools/insight_trace.py

Generated simpler replay workspace:

replay_kernel.cpp
replay_launch.cpp
replay_host.cpp
CMakeLists.txt
run_collect.sh
insight_trace_config.json

The generated host runner allocates replay tensors, builds Tensor metadata, packs a 50-slot args array, launches replay_entry, and synchronizes.

Initial kernel support

Built-in recipes cover paged attention incore kernels:

QK, SF, PV, UP

For CaseSmall1 + SF:

args[0] sij   FLOAT32  [16, 16]
args[1] pij   BFLOAT16 [16, 16]
args[2] mij   FLOAT32  [16]
args[3] lij   FLOAT32  [16]
args[4] scale FLOAT32_BITS 1065353216

PTOAS backend

Adds PTOAS backend plumbing for PTOAS-generated kernel C++ sources: call PTOAS generate_testcase.py, build the generated simulator runner, resolve the real exported kernel symbol with nm -D + c++filt, run msprof op simulator, and export the same MindStudio Insight simulator/ artifact shape.

Validation hardening

The implementation rejects invalid replay arg specs: negative indices, indices outside the 50-slot args array, and duplicate indices. It also packs float scalar values as IEEE 754 bits when pack_mode="bits" or dtype="FLOAT32_BITS" is used.

Other hardening:

cache repeated kernel source reads;
demangle PTOAS symbols with one c++filt call;
fail clearly if PTOAS golden.py is missing or fails.

Tests

Added tests:

tests/ut/py/test_insight_trace_core.py

Validated with:

python -m pytest tests/ut/py/test_insight_trace_core.py -v
# 6 passed
python -m compileall -q simpler_setup/insight_trace \
  simpler_setup/tools/insight_trace.py tests/ut/py/test_insight_trace_core.py
# passed

Scope

Changed files:

simpler_setup/insight_trace/*
simpler_setup/tools/insight_trace.py
tests/ut/py/test_insight_trace_core.py

Overall diff:

12 files changed, 1302 insertions(+)

Output: outputs/insight_trace_*/insight_export/OPPROF_*/simulator/ (drag trace.json into MindStudio Insight).

Implement Insight Trace feature to generate MindStudio Insight-compatible trace data (trace.json + visualize_data.bin) for incore kernel instruction-level profiling. Supports simpler and ptoas dual backends.

gemini-code-assist

Code Review

This pull request introduces a toolset for generating MindStudio Insight trace data for incore kernels, featuring both 'simpler' and 'ptoas' backends. Key components include a CLI, argument resolution recipes for paged attention, kernel classification logic, and workspace generation via C++ templates. Feedback identifies a bug in scalar argument bit-packing and a potential out-of-bounds access in the generated host code. Additionally, improvements were suggested for optimizing file I/O and external process calls, along with the removal of incomplete code in the PTOAS backend.

gemini-code-assist · 2026-05-19T07:51:59Z

+        elif item["kind"] == "scalar":
+            result.append(
+                TraceScalarArg(
+                    index=int(item["index"]),
+                    name=item["name"],
+                    dtype=item["dtype"],
+                    value=item["value"],
+                    pack_mode=item.get("pack_mode", "value"),
+                )
+            )


When loading scalar arguments from a JSON specification, float values intended to be passed as bit patterns (e.g., for FLOAT32 kernel arguments) are currently truncated to integers by the template renderer. If pack_mode is set to "bits", the float value should be converted to its IEEE 754 bit representation here to ensure the generated C++ code receives the correct data.

Suggested change

elif item["kind"] == "scalar":

result.append(

TraceScalarArg(

index=int(item["index"]),

name=item["name"],

dtype=item["dtype"],

value=item["value"],

pack_mode=item.get("pack_mode", "value"),

)

)

elif item["kind"] == "scalar":

val = item["value"]

pack_mode = item.get("pack_mode", "value")

if pack_mode == "bits" and isinstance(val, float):

val = _f32_bits(val)

result.append(

TraceScalarArg(

index=int(item["index"]),

name=item["name"],

dtype=item["dtype"],

value=val,

pack_mode=pack_mode,

)

)

gemini-code-assist · 2026-05-19T07:51:59Z

+def render_host(config: TraceConfig) -> str:
+    tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)]
+    scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)]


The generated C++ host code uses a fixed-size array std::array<int64_t, kArgsSlots> args where kArgsSlots is 50. There is currently no validation to ensure that argument indices provided in the configuration do not exceed this limit, which could lead to out-of-bounds access in the generated code.

Suggested change

def render_host(config: TraceConfig) -> str:

tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)]

scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)]

def render_host(config: TraceConfig) -> str:

for arg in config.args:

if arg.index >= 50:

raise ValueError(f"Argument index {arg.index} exceeds maximum supported slots (50)")

tensors = [arg for arg in config.args if isinstance(arg, TraceTensorArg)]

scalars = [arg for arg in config.args if isinstance(arg, TraceScalarArg)]

gemini-code-assist · 2026-05-19T07:52:00Z

+from __future__ import annotations
+
+import re
+from pathlib import Path
+
+from .models import KernelShape, KernelSpec, SceneCaseContext
+
+_ARG_READ_RE = re.compile(r"args\[(\d+)\]")


The kernel source file is read from disk multiple times across different functions (classify_kernel, read_arg_indices, and validate_single_task_kernel). Following the general rule to avoid redundant I/O, consider using functools.lru_cache on a helper function to read the file content once and reuse it.

from __future__ import annotations import re from functools import lru_cache from pathlib import Path from .models import KernelShape, KernelSpec, SceneCaseContext _ARG_READ_RE = re.compile(r"args\[(\d+)\]") @lru_cache(maxsize=8) def _read_source(path: Path) -> str: return path.read_text()

References

Avoid redundant I/O and parsing of large files by loading the data once and passing the parsed object to downstream functions.

gemini-code-assist · 2026-05-19T07:52:00Z

+    candidates = []
+    for line in result.stdout.splitlines():
+        parts = line.split()
+        if len(parts) < 3 or parts[-2] not in {"T", "W"}:
+            continue
+        symbol = parts[-1]
+        demangled = subprocess.run(["c++filt", symbol], check=False, capture_output=True, text=True).stdout.strip()
+        if demangled.startswith(f"{kernel_base_name}("):
+            candidates.append(symbol)


Calling c++filt in a loop for every symbol found in the library is inefficient as it spawns a new process for each iteration. It is better to batch all symbols and pass them to a single c++filt process via standard input. Additionally, before using zip on the symbols and demangled names, verify their lengths are identical to prevent silent truncation.

symbols = [] for line in result.stdout.splitlines(): parts = line.split() if len(parts) >= 3 and parts[-2] in {"T", "W"}: symbols.append(parts[-1]) if not symbols: raise ValueError(f"No symbols found in {kernel_lib}") demangle_proc = subprocess.run( ["c++filt"], input="\n".join(symbols), capture_output=True, text=True, check=True ) demangled_names = demangle_proc.stdout.splitlines() if len(symbols) != len(demangled_names): raise ValueError(f"Mismatched symbol count: {len(symbols)} symbols, {len(demangled_names)} demangled names") candidates = [ sym for sym, demangled in zip(symbols, demangled_names) if demangled.startswith(f"{kernel_base_name}(") ]

References

Before using zip on sequences that are expected to be of equal length, verify their lengths are identical and raise an error on mismatch to prevent silent truncation.

gemini-code-assist · 2026-05-19T07:52:00Z

+    main_cpp = case_dir / "main.cpp"
+    if not main_cpp.is_file():
+        return
+    for line in main_cpp.read_text().splitlines():
+        if ".bin" not in line or "fileSize_" not in line:
+            continue
+


This block of code appears to be incomplete or dead code. It iterates over lines in main.cpp but performs no action. If this logic is not yet implemented, it should be removed or replaced with a TODO.

Tighten generated argument handling and PTOAS symbol/input processing so invalid replay specs fail early and float bit scalars preserve their intended encoding. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Add: Insight Trace workspace generation for MindStudio profiling

8c85495

Implement Insight Trace feature to generate MindStudio Insight-compatible trace data (trace.json + visualize_data.bin) for incore kernel instruction-level profiling. Supports simpler and ptoas dual backends.

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Fix insight trace review findings

e3e877d

Tighten generated argument handling and PTOAS symbol/input processing so invalid replay specs fail early and float bit scalars preserve their intended encoding. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: Insight Trace workspace generation for MindStudio profiling#821

Add: Insight Trace workspace generation for MindStudio profiling#821
vegetabledoww wants to merge 2 commits into
hw-native-sys:mainfrom
vegetabledoww:2b

vegetabledoww commented May 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vegetabledoww commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Overview

MindStudio Insight output

What changed

Initial kernel support

PTOAS backend

Validation hardening

Tests

Scope

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vegetabledoww commented May 19, 2026 •

edited

Loading