Fix #380:issue 380 dlio mpi rank#396
Open
idevasena wants to merge 2 commits into
Open
Conversation
…nt mpirun launch test_dlio_mpi.py selected its endpoint via int(os.environ['OMPI_COMM_WORLD_RANK']), which raises KeyError whenever the script is not launched by OpenMPI mpirun (plain python, MPICH, srun). Switch to the portable comm.Get_rank() (matching test_mpi_basic.py) and read the OMPI var only for diagnostics. Update tests/README.md to launch with mpirun and document --oversubscribe for under-provisioned hosts. Add regression test.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #380: portable MPI rank in test_dlio_mpi.py + correct DLIO launch commands
Fixes #380
Summary
The integration test
tests/integration/test_dlio_mpi.py, run as documented, crashed before doing anything useful, and the follow-ondlio_benchmarkcommand it printed was itself invalid. This PR fixes both the test and the documentedcommands, and adds a regression test.
Root causes
1. KeyError on non-OpenMPI launchers. The test selected its endpoint with
int(os.environ['OMPI_COMM_WORLD_RANK']). That env var is OpenMPI-specific and is unset under plainpython, MPICH/mpiexec, or Slurmsrun, so the test died with a bareKeyError: 'OMPI_COMM_WORLD_RANK'. The siblingtest_mpi_basic.pyalready does this correctly viacomm.Get_rank()and a safeos.environ.get(...).2. Docs launched an MPI program without MPI.
tests/README.mdtold users to runpython tests/integration/test_dlio_mpi.py, which yields a single rank (not a meaningful multi-endpoint test) and triggers the KeyError above. It also gave no guidance for the "not enough slots available" error that OpenMPI raises when-npexceeds the host core count.3. Invalid
dlio_benchmarkinvocation. Both the test's printed "Next steps" and the trailing comment inconfigs/dlio/workload/multi_endpoint_mpi.yamlinstructeddlio_benchmark --config multi_endpoint_mpi.yaml.dlio_benchmarkis a Hydra app and rejects--configas an ambiguous abbreviation of--config-path/--config-name/--config-dir. The correct form selects the workload via a Hydra override.Changes
tests/integration/test_dlio_mpi.py— select the endpoint fromrank = comm.Get_rank()instead of the OpenMPI env var (read now only for display, viaos.environ.get(..., 'not set')); add asize == 1guard that prints the correctmpiruninvocation; correct the printed "Next steps" command todlio_benchmark workload=multi_endpoint_mpi --config-dir=configs/dlio.tests/README.md— launch the test withmpirun -np 8 ...and document--oversubscribefor hosts with fewer cores than-np.configs/dlio/workload/multi_endpoint_mpi.yaml— fix the how-to-run comment to the valid Hydraworkload=... --config-dir=...form, with a note on why--config <file>fails.tests/integration/test_issue_380_dlio_mpi_rank.py(new) — regression tests that reproduce the originalKeyError, pin rank-based round-robin endpoint selection, and guard (via source checks) that neither the hard env-var subscriptnor the ambiguous
dlio_benchmark --configpattern can return. No live MPI runtime or mpi4py required.The corrected command form matches the repo's own harness in
mlpstorage_py/benchmarks/dlio.pyand the siblingconfigs/dlio/workload/llama3_8b_checkpoint.yaml.Validation
pytest tests/integration/test_issue_380_dlio_mpi_rank.py -v— 15 passed.python tests/integration/test_dlio_mpi.py(no launcher) — exits 0 with a launcher hint instead of crashing.mpirun -np 8 python tests/integration/test_dlio_mpi.py— ranks 0–7 map round-robin to endpoints 0–3; rank N selects endpoint[N % 4].Tests