Skip to content

Add runtime args records to tensor dump#801

Open
zmnobug wants to merge 1 commit into
hw-native-sys:mainfrom
zmnobug:dump_args
Open

Add runtime args records to tensor dump#801
zmnobug wants to merge 1 commit into
hw-native-sys:mainfrom
zmnobug:dump_args

Conversation

@zmnobug
Copy link
Copy Markdown

@zmnobug zmnobug commented May 18, 2026

Summary

Add Dump Args support by reusing the existing tensor dump channel.

This feature records task runtime args observed on the device side and exports them to host-side
tensor_dump.json. The exported args records can be correlated with tensor dump, swimlane, PMU, and task ids
for DFX analysis.

Changes

  • Reuse the existing --dump-tensor channel for Dump Args.

    • No new standalone switch.
    • No new shared-memory channel.
    • No new output directory.
  • Extend tensor dump record handling.

    • Support both normal tensor dump records and args dump records.
    • Args payloads reuse the existing dump arena, metadata buffer, ready queue, and host collector flow.
  • Export args records to tensor_dump.json.

    • New top-level field: total_args
    • New top-level array: args
  • Record device-side task args at dispatch time.

    • task_id
    • subtask_id
    • func_id
    • stage
    • tensor/scalar arg counts
    • tensor arg metadata, including address, size, dtype, shape, raw shape, and offsets
    • scalar arg values
  • Extend dump_viewer to list args records.

    • Usage: python -m simpler_setup.tools.dump_viewer --args
  • Wire Dump Args into both runtime paths.

    • tensormap_and_ringbuffer
    • host_build_graph
  • Fix PTO2 AIV-only task args metadata.

    • The previous implementation always used kernel_id[0] as the args record func_id.
    • For AIV-only tasks, kernel_id[0] is INVALID_KERNEL_ID=-1, which is exported as 4294967295 in JSON.
    • The implementation now selects the first active subtask slot with a valid kernel id and records the matching
      subtask_id.

Example Output

tensor_dump.json now includes an args array, for example:

{
  "total_args": 2,
  "args": [
    {
      "task_id": "0x0000000000000000",
      "subtask_id": 0,
      "func_id": 0,
      "stage": "before_dispatch",
      "tensor_count": 3,
      "scalar_count": 0,
      "tensors": [
        {
          "arg_index": 0,
          "buffer_addr": "0x...",
          "buffer_size": 65536,
          "owner_task_id": "0x...",
          "dtype": "FLOAT32",
          "shape": [16384],
          "raw_shape": [16384],
          "offsets": [0],
          "is_contiguous": true,
          "is_all_offset_zero": true
        }
      ],
      "scalars": []
    }
  ]
}

## Validation

### Static Check

- git diff --check
    - Result: passed

### a2a3sim PTO2 Tensor Dump Smoke

- Test:
    - tests/st/a2a3/tensormap_and_ringbuffer/dfx/tensor_dump/test_tensor_dump.py
- Flags:
    - --platform a2a3sim
    - --dump-tensor
    - --build
- Result:
    - passed
- Manifest check:
    - total_args=5
    - func_ids=[0, 1, 2]
    - subtask_ids=[1]

### a5sim PTO2 Tensor Dump Smoke

- Test:
    - examples/a5/tensormap_and_ringbuffer/vector_example
- Flags:
    - --platform a5sim
    - --dump-tensor
    - --build
- Result:
    - passed
- Manifest check:
    - total_args=5
    - func_ids=[0, 1, 2]
    - subtask_ids=[1]

### a2a3 Hardware PTO2 Tensor Dump Smoke

- Test:
    - tests/st/a2a3/tensormap_and_ringbuffer/dfx/tensor_dump/test_tensor_dump.py
- Submitted through the shared NPU queue:
    - task-submit --device auto
- Result:
    - passed
- Manifest check:
    - total_args=5
    - func_ids=[0, 1, 2]
    - subtask_ids=[1]

### a2a3 Hardware host_build_graph Dump Tensor Example

- Test:
    - tests/st/a2a3/host_build_graph/dump_tensor/test_dump_tensor_example.py
- Submitted through the shared NPU queue:
    - task-submit --device auto
- Result:
    - passed
- Manifest check:
    - total_tensors=5
    - total_args=2
    - func_ids=[0, 1]
    - subtask_ids=[0]

## Notes

- Dump Args intentionally reuses the tensor dump lifecycle and output directory structure.
- Args are currently captured at the before_dispatch stage.
- If L0 swimlane later requires independent args enable/disable or trigger behavior, a separate channel can be
  considered then.

Reuse the existing tensor dump channel to capture task runtime arguments from
AICPU execution and export them on the host side. Add ARGS dump records, payload
metadata for tensor and scalar arguments, collector-side JSON export, and viewer
support for listing dumped args.

Wire args dumping into both tensormap/ringbuffer and host-build-graph runtimes
for A2A3 and A5. Keep tensor dump records backward-compatible by reusing the
existing alignment byte as a record kind discriminator.

Fix the A5 collector integration with the profiler refactor by restoring
ProfilerBase inheritance, removing stale polling helpers, and supporting
args-only JSON export.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the capability to capture and dump task argument descriptors, including tensor buffer descriptors and scalar values, alongside existing tensor dumps. The changes span the AICPU runtime, host-side collectors, and documentation, adding a new record type to the dump channel and updating the JSON manifest format to include an 'args' array. Feedback from the review highlights the need to ensure 64-byte alignment for the ArgsDumpTensorEntry struct to optimize cache performance. Additionally, it is recommended to serialize all 64-bit unsigned integers (such as buffer addresses, task IDs, and scalar values) as strings in the JSON output to prevent precision loss in JavaScript-based parsers.

Comment on lines +97 to +109
struct ArgsDumpTensorEntry {
uint64_t buffer_addr;
uint64_t buffer_size;
uint64_t owner_task_id;
uint32_t shapes[PLATFORM_DUMP_MAX_DIMS];
uint32_t raw_shapes[PLATFORM_DUMP_MAX_DIMS];
uint32_t offsets[PLATFORM_DUMP_MAX_DIMS];
uint32_t ndims;
uint8_t dtype;
uint8_t is_contiguous;
uint8_t is_all_offset_zero;
uint8_t reserved;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ArgsDumpTensorEntry struct should be 64-byte aligned to ensure optimal cache performance and prevent regressions. Please adjust the layout or add padding to make the struct size a multiple of 64, and include a static_assert to verify the alignment of critical members.

References
  1. Ensure critical struct layout alignments (especially for cache performance) are 64-byte aligned and protected by static_assert to prevent regressions.

Comment on lines +647 to +650
json << "{\"arg_index\": " << t << ", \"buffer_addr\": \"0x" << std::hex << entry.buffer_addr << std::dec
<< "\", \"buffer_size\": " << entry.buffer_size << ", \"owner_task_id\": \"0x" << std::hex
<< entry.owner_task_id << std::dec << "\", \"dtype\": \""
<< get_dtype_name_from_raw(entry.dtype) << "\", \"shape\": "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The buffer_addr and owner_task_id fields are 64-bit values. When serializing these to JSON, they must be represented as strings (e.g., using std::setfill('0') << std::setw(16)) to prevent precision loss in JavaScript-based parsers which are limited to 2^53 - 1.

References
  1. When serializing 64-bit unsigned integers to JSON, represent them as strings to prevent precision loss in JavaScript-based parsers.

break;
}
if (s > 0) json << ", ";
json << "\"0x" << std::hex << value << std::dec << "\"";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This 64-bit scalar value must be serialized as a string in the JSON output to avoid precision loss in external parsers. Using a consistent hex format with padding is recommended.

References
  1. When serializing 64-bit unsigned integers to JSON, represent them as strings to prevent precision loss in JavaScript-based parsers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant