Skip to content

fix(cann): prevent race condition on .om cache file during multi-card compilation#28533

Open
0AnshuAditya0 wants to merge 2 commits into
microsoft:mainfrom
0AnshuAditya0:fix/cann-om-race-condition
Open

fix(cann): prevent race condition on .om cache file during multi-card compilation#28533
0AnshuAditya0 wants to merge 2 commits into
microsoft:mainfrom
0AnshuAditya0:fix/cann-om-race-condition

Conversation

@0AnshuAditya0
Copy link
Copy Markdown

Description

Fixes #26778

When running multi-card inference via torchrun --nproc_per_node=8 with
dump_om_model enabled, a TOCTOU (Time-Of-Check-Time-Of-Use) race condition
causes aclmdlLoadFromFile() to fail or produce accuracy errors.

Fix:

  • BuildONNXModel (cann_graph.cc): write to a temporary file
    (file_name + "_tmp") first, then call std::filesystem::rename() once
    the write is complete. POSIX rename(2) is atomic — readers will only ever
    observe the file as absent or complete, never partial.
  • MatchFile (cann_utils.cc): explicitly skip _tmp.om files as a
    defensive guard against any future naming overlap.

Motivation and Context

reproducible CANN error executing aclmdlLoadFromFile() under
8-card concurrent compilation with torchrun.

The fix is validated via a standalone C++ mock harness simulating 8 concurrent
processes racing on the same file path (no Ascend hardware required — the race
is purely filesystem-level). The buggy path produces consistent read corruptions;
the atomic-rename path produces zero corruptions across 10,000+ iterations.

Changes

File Change
onnxruntime/core/providers/cann/cann_graph.cc PID-isolated temp file + atomic rename in BuildONNXModel
onnxruntime/core/providers/cann/cann_utils.cc Skip _tmp_ files in MatchFile

@0AnshuAditya0
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Build] NPU Multi-Card Concurrent Compilation Causes OM File Race Condition (Read-While-Write Error)

1 participant