Skip to content

fix: Secure tar archive extraction#15811

Open
chtruong814 wants to merge 2 commits into
mainfrom
chtruong/extract
Open

fix: Secure tar archive extraction#15811
chtruong814 wants to merge 2 commits into
mainfrom
chtruong/extract

Conversation

@chtruong814

@chtruong814 chtruong814 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do ?

Adds a shared safe tar extraction helper and replaces raw tar extraction in model, notebook, test-data, and dataset-processing paths to prevent path traversal when unpacking archives. It also validates save_artifacts() artifact paths before extract/copy/rename and sanitizes ASR tokenizer rename targets so raw archive metadata cannot move files outside the intended output directory.

Changelog

  • Add nemo.utils.tar_utils.safe_extract, which rejects absolute paths, .. traversal, symlinks, and hardlinks by default.
  • Route SaveRestoreConnector._safe_extract() through the shared helper while preserving its legacy skip-and-warn behavior with skip_unsafe=True.
  • Replace raw tar.extract() / tar.extractall() usage across affected model utilities, notebook dataset download, test-data setup, dataset-processing scripts, and the torchaudio conversion script.
  • Validate save_artifacts() artifact paths before constructing extract, copy, or rename paths.
  • Derive ASR tokenizer output names from extracted member basenames before os.rename().
  • Add regression tests for traversal rejection, absolute artifact-path move prevention, ASR tokenizer rename target sanitization, and a static guard preventing raw tar extraction outside the shared helper.

Verification

  • MPLCONFIGDIR=.pytest_cache/matplotlib UV_CACHE_DIR=.uv-cache uv run pytest tests/utils/test_tar_utils.py -v
  • Targeted tests/core/test_save_restore.py save/restore compatibility cases
  • UV_CACHE_DIR=.uv-cache uv run isort --check <changed files>
  • UV_CACHE_DIR=.uv-cache uv run --with black==24.10.0 black --check <changed files>
  • rg -n "tar\\.extract(all)?\\(" nemo tests scripts

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions github-actions Bot added the ASR label Jun 18, 2026
@chtruong814 chtruong814 changed the title Secure tar archive extraction fix: Secure tar archive extraction Jun 18, 2026
@chtruong814 chtruong814 added r3.0.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. skip-linting labels Jun 18, 2026
@chtruong814

Copy link
Copy Markdown
Collaborator Author

/ok to test a6b1957

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Collaborator Author

/ok to test 53c721a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR core Changes to NeMo Core r3.0.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. skip-linting TTS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant