Add runtime args records to tensor dump#801
Conversation
Reuse the existing tensor dump channel to capture task runtime arguments from AICPU execution and export them on the host side. Add ARGS dump records, payload metadata for tensor and scalar arguments, collector-side JSON export, and viewer support for listing dumped args. Wire args dumping into both tensormap/ringbuffer and host-build-graph runtimes for A2A3 and A5. Keep tensor dump records backward-compatible by reusing the existing alignment byte as a record kind discriminator. Fix the A5 collector integration with the profiler refactor by restoring ProfilerBase inheritance, removing stale polling helpers, and supporting args-only JSON export.
There was a problem hiding this comment.
Code Review
This pull request introduces the capability to capture and dump task argument descriptors, including tensor buffer descriptors and scalar values, alongside existing tensor dumps. The changes span the AICPU runtime, host-side collectors, and documentation, adding a new record type to the dump channel and updating the JSON manifest format to include an 'args' array. Feedback from the review highlights the need to ensure 64-byte alignment for the ArgsDumpTensorEntry struct to optimize cache performance. Additionally, it is recommended to serialize all 64-bit unsigned integers (such as buffer addresses, task IDs, and scalar values) as strings in the JSON output to prevent precision loss in JavaScript-based parsers.
| struct ArgsDumpTensorEntry { | ||
| uint64_t buffer_addr; | ||
| uint64_t buffer_size; | ||
| uint64_t owner_task_id; | ||
| uint32_t shapes[PLATFORM_DUMP_MAX_DIMS]; | ||
| uint32_t raw_shapes[PLATFORM_DUMP_MAX_DIMS]; | ||
| uint32_t offsets[PLATFORM_DUMP_MAX_DIMS]; | ||
| uint32_t ndims; | ||
| uint8_t dtype; | ||
| uint8_t is_contiguous; | ||
| uint8_t is_all_offset_zero; | ||
| uint8_t reserved; | ||
| }; |
There was a problem hiding this comment.
The ArgsDumpTensorEntry struct should be 64-byte aligned to ensure optimal cache performance and prevent regressions. Please adjust the layout or add padding to make the struct size a multiple of 64, and include a static_assert to verify the alignment of critical members.
References
- Ensure critical struct layout alignments (especially for cache performance) are 64-byte aligned and protected by static_assert to prevent regressions.
| json << "{\"arg_index\": " << t << ", \"buffer_addr\": \"0x" << std::hex << entry.buffer_addr << std::dec | ||
| << "\", \"buffer_size\": " << entry.buffer_size << ", \"owner_task_id\": \"0x" << std::hex | ||
| << entry.owner_task_id << std::dec << "\", \"dtype\": \"" | ||
| << get_dtype_name_from_raw(entry.dtype) << "\", \"shape\": " |
There was a problem hiding this comment.
The buffer_addr and owner_task_id fields are 64-bit values. When serializing these to JSON, they must be represented as strings (e.g., using std::setfill('0') << std::setw(16)) to prevent precision loss in JavaScript-based parsers which are limited to 2^53 - 1.
References
- When serializing 64-bit unsigned integers to JSON, represent them as strings to prevent precision loss in JavaScript-based parsers.
| break; | ||
| } | ||
| if (s > 0) json << ", "; | ||
| json << "\"0x" << std::hex << value << std::dec << "\""; |
There was a problem hiding this comment.
This 64-bit scalar value must be serialized as a string in the JSON output to avoid precision loss in external parsers. Using a consistent hex format with padding is recommended.
References
- When serializing 64-bit unsigned integers to JSON, represent them as strings to prevent precision loss in JavaScript-based parsers.
Summary
Add Dump Args support by reusing the existing tensor dump channel.
This feature records task runtime args observed on the device side and exports them to host-side
tensor_dump.json. The exported args records can be correlated with tensor dump, swimlane, PMU, and task idsfor DFX analysis.
Changes
Reuse the existing
--dump-tensorchannel for Dump Args.Extend tensor dump record handling.
Export args records to
tensor_dump.json.total_argsargsRecord device-side task args at dispatch time.
task_idsubtask_idfunc_idstageExtend
dump_viewerto list args records.python -m simpler_setup.tools.dump_viewer --argsWire Dump Args into both runtime paths.
tensormap_and_ringbufferhost_build_graphFix PTO2 AIV-only task args metadata.
kernel_id[0]as the args recordfunc_id.kernel_id[0]isINVALID_KERNEL_ID=-1, which is exported as4294967295in JSON.subtask_id.Example Output
tensor_dump.jsonnow includes anargsarray, for example:{ "total_args": 2, "args": [ { "task_id": "0x0000000000000000", "subtask_id": 0, "func_id": 0, "stage": "before_dispatch", "tensor_count": 3, "scalar_count": 0, "tensors": [ { "arg_index": 0, "buffer_addr": "0x...", "buffer_size": 65536, "owner_task_id": "0x...", "dtype": "FLOAT32", "shape": [16384], "raw_shape": [16384], "offsets": [0], "is_contiguous": true, "is_all_offset_zero": true } ], "scalars": [] } ] } ## Validation ### Static Check - git diff --check - Result: passed ### a2a3sim PTO2 Tensor Dump Smoke - Test: - tests/st/a2a3/tensormap_and_ringbuffer/dfx/tensor_dump/test_tensor_dump.py - Flags: - --platform a2a3sim - --dump-tensor - --build - Result: - passed - Manifest check: - total_args=5 - func_ids=[0, 1, 2] - subtask_ids=[1] ### a5sim PTO2 Tensor Dump Smoke - Test: - examples/a5/tensormap_and_ringbuffer/vector_example - Flags: - --platform a5sim - --dump-tensor - --build - Result: - passed - Manifest check: - total_args=5 - func_ids=[0, 1, 2] - subtask_ids=[1] ### a2a3 Hardware PTO2 Tensor Dump Smoke - Test: - tests/st/a2a3/tensormap_and_ringbuffer/dfx/tensor_dump/test_tensor_dump.py - Submitted through the shared NPU queue: - task-submit --device auto - Result: - passed - Manifest check: - total_args=5 - func_ids=[0, 1, 2] - subtask_ids=[1] ### a2a3 Hardware host_build_graph Dump Tensor Example - Test: - tests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.py - Submitted through the shared NPU queue: - task-submit --device auto - Result: - passed - Manifest check: - total_tensors=5 - total_args=2 - func_ids=[0, 1] - subtask_ids=[0] ## Notes - Dump Args intentionally reuses the tensor dump lifecycle and output directory structure. - Args are currently captured at the before_dispatch stage. - If L0 swimlane later requires independent args enable/disable or trigger behavior, a separate channel can be considered then.