Skip to content

Pull requests: awslabs/awsome-distributed-training

Author
Filter by author
Loading
Label
Filter by label
Loading
Use alt + click/return to exclude labels
or + click/return for logical OR
Projects
Filter by project
Loading
Milestones
Filter by milestone
Loading
Reviews
Assignee
Filter by who’s assigned
Assigned to nobody Loading
Sort

Pull requests list

Add Terraform support for task governance compute quotas
#1081 opened May 1, 2026 by FreCap Loading…
nemo: drop unnecessary OMPI_MCA_* Dockerfile ENVs per AWS NCCL/EFA review
#1078 opened May 1, 2026 by KeitaW Collaborator Loading…
4 of 5 tasks
nccl-tests: drop unnecessary NCCL/MPI env vars per AWS NCCL/EFA review
#1077 opened May 1, 2026 by KeitaW Collaborator Loading…
4 of 5 tasks
torchtitan: replace conda env with venv and pin all versions
#1074 opened Apr 29, 2026 by KeitaW Collaborator Draft
4 tasks done
FSDP: pin nccl-tests base, bump torch to cu130, log TFLOPS/MFU
#1073 opened Apr 29, 2026 by KeitaW Collaborator Draft
6 of 8 tasks
nemo: bump to nemo:26.02 and sync Slurm + Kubernetes Dockerfiles
#1072 opened Apr 29, 2026 by KeitaW Collaborator Draft
5 of 6 tasks
megatron-lm: bump to NGC pytorch:26.02 and add Llama 3 8B sbatch
#1071 opened Apr 29, 2026 by KeitaW Collaborator Draft
5 tasks done
feat: add DETR-ResNet50 object detection fine-tuning test case
#1068 opened Apr 28, 2026 by aravneelaws Contributor Loading…
7 tasks done
Adding a megatron-bridge sample
#1065 opened Apr 27, 2026 by allela-roy Contributor Loading…
Bump transformers from 4.48.0 to 5.0.0rc3 in /3.test_cases/pytorch/nvrx dependencies Pull requests that update a dependency file python Pull requests that update python code
#1057 opened Apr 8, 2026 by dependabot Bot Loading…
Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge
#1054 opened Apr 4, 2026 by nkumaraws Contributor Loading…
Bump requests from 2.32.3 to 2.33.0 in /3.test_cases/pytorch/nvrx dependencies Pull requests that update a dependency file python Pull requests that update python code
#1036 opened Mar 25, 2026 by dependabot Bot Loading…
Add V-JEPA 2 (Meta FAIR) distributed training test case
#1035 opened Mar 23, 2026 by paragao Contributor Loading…
Add DeepSpeed CI regression tests for QLoRA and GPT-103B
#1029 opened Mar 20, 2026 by paragao Contributor Loading…
fix: overhaul CI workflows for FSDP regression tests
#1024 opened Mar 17, 2026 by paragao Contributor Loading…
Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+
#1022 opened Mar 13, 2026 by aravneelaws Contributor Loading…
7 tasks done
docs: comprehensive instance hardware profiles (16 families)
#1021 opened Mar 13, 2026 by KeitaW Collaborator Draft
4 tasks
Add OSMO AMR Navigation test case
#1018 opened Mar 12, 2026 by KeitaW Collaborator Loading…
1 of 3 tasks
Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS
#1010 opened Mar 9, 2026 by dmvevents Contributor Loading…
6 tasks
Add LeRobot pi0-FAST DROID multi-node training test case
#1003 opened Feb 26, 2026 by KeitaW Collaborator Draft
7 tasks
Updating CF stack for GB200 local zone deployments
#968 opened Feb 17, 2026 by KeitaW Collaborator Loading…
ProTip! Exclude everything labeled bug with -label:bug.