Skip to content

Add Remote GPU Rental Training rule (rented/spot instance ops)#318

Open
Hanyuyuan6 wants to merge 2 commits into
PatrickJS:mainfrom
Hanyuyuan6:add-remote-gpu-rental-training
Open

Add Remote GPU Rental Training rule (rented/spot instance ops)#318
Hanyuyuan6 wants to merge 2 commits into
PatrickJS:mainfrom
Hanyuyuan6:add-remote-gpu-rental-training

Conversation

@Hanyuyuan6

@Hanyuyuan6 Hanyuyuan6 commented Jun 20, 2026

Copy link
Copy Markdown

Adds rules/remote-gpu-rental-training.mdc under Build Tools and Development.

The rule gives Cursor operating discipline for running long GPU jobs on rented/remote instances you don't own (AutoDL, RunPod, vast.ai, Lambda, Paperspace, Chinese platforms, bare SSH / Slurm / K8s) — the metered-tenant concerns the existing CUDA/Kubernetes/containerization rules don't cover: stop-vs-terminate billing safety, spot-preemption resilience, checkpoint-to-durable + idempotent resume, disk/inode budgeting, and a teardown gate that blocks terminate/destroy until results are pulled and verified.

It is a condensed, self-contained distillation of the open-source remote-gpu-trainer Agent Skill (MIT, https://github.com/Hanyuyuan6/remote-gpu-trainer), authored by me — not a copy of the source docs. Frontmatter follows the README ## Contributing format (description / globs / alwaysApply: false); README entry added alphabetically in the correct category.

Summary by CodeRabbit

Documentation

  • Added a new documentation page with comprehensive guidance for running long GPU training jobs on remote/rented machines, including cost-control practices, pre-flight checks, and trustworthy artifact verification.
  • Documented a full multi-phase job lifecycle, including resumable/idempotent checkpoint transfer and checkpoint-safe teardown.
  • Included platform-specific gotchas and troubleshooting tips for common failure modes (e.g., OOM, multi-GPU hangs, NaN/loss spikes).

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4b04bd1-8b86-4e5f-8345-0c7c02f84009

📥 Commits

Reviewing files that changed from the base of the PR and between 4fd4aa6 and db0a3bd.

📒 Files selected for processing (1)
  • README.md
✅ Files skipped from review due to trivial changes (1)
  • README.md

📝 Walkthrough

Walkthrough

Adds a new rules/remote-gpu-rental-training.mdc cursor rule file covering remote GPU rental workflows: operating principles, a 6-phase lifecycle, an iron-law teardown gate, platform-specific gotchas, training failure diagnostics, and upstream attribution. Updates README.md with a link to the new rule.

Changes

Remote GPU Rental Training Rule

Layer / File(s) Summary
Remote GPU training rule content
rules/remote-gpu-rental-training.mdc
Adds the full rule with front-matter (alwaysApply: false), 10 operating principles (cost control, smoke testing, stop-vs-destroy semantics, inode monitoring, atomic checkpointing, approval gates), a 6-phase lifecycle (audit → SSH setup → wrapper → detached launch → durable monitoring → checkpoint sync/verify), an iron-law teardown gate blocking destroy/delete until checkpoints are verified locally, platform gotchas (AutoDL meter behavior, spot preemption, disk-full save failures, CRLF issues, China-specific mirrors), training failure guidance (CUDA OOM ladder, multi-GPU hangs, NaN spikes), and upstream attribution.
README index entry
README.md
Adds a bullet under "Build Tools and Development" linking to the new .mdc rule.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 A bunny hops on rented GPUs,
Checkpoints safely before the bill accrues,
Six phases planned, the iron law held tight,
No destroy command till artifacts are right,
From CUDA OOM to AutoDL's quirks—
This rule's got you covered when training works! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a new rule file for remote GPU rental training operations, with clear context about the use case (rented/spot instances).
Description check ✅ Passed The description covers all required sections from the template: summary, contribution type (new rule file), value to users, files changed, quality checklist, and notes on attribution. All critical items are addressed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
README.md (1)

261-261: 💤 Low value

Minor: Refine README description wording for clarity.

The README description says "spot-preemption resumable checkpointing," but this awkwardly combines two separate ideas. The rule file correctly lists them as "spot-preemption resilience, and resumable checkpointing." Consider updating the README to match:

"Run long GPU jobs on rented/remote instances (AutoDL, RunPod, vast.ai, Lambda, Slurm) with billing-safe teardown, spot-preemption resilience, resumable checkpointing, and disk/inode budgeting."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 261, The README description for the Remote GPU Rental
Training link awkwardly combines "spot-preemption resumable checkpointing" as a
single concept. Update this phrase to separate the two distinct ideas by
changing it to "spot-preemption resilience, resumable checkpointing" so it
matches the correct wording in the actual rule file and improves clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@README.md`:
- Line 261: The README description for the Remote GPU Rental Training link
awkwardly combines "spot-preemption resumable checkpointing" as a single
concept. Update this phrase to separate the two distinct ideas by changing it to
"spot-preemption resilience, resumable checkpointing" so it matches the correct
wording in the actual rule file and improves clarity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb0353a8-b709-4855-a6b6-732efb9ce01c

📥 Commits

Reviewing files that changed from the base of the PR and between b044f95 and 4fd4aa6.

📒 Files selected for processing (2)
  • README.md
  • rules/remote-gpu-rental-training.mdc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant