feat: callback support #805#875
Conversation
…o#805) Add PyTorch Lightning-style callbacks with hooks for train start/end, batch end, validation end, checkpoint save, and exceptions. Includes @rank_zero_only decorator for distributed training. Callbacks are passed programmatically to recipe constructors and receive full training context (recipe object, MetricsSample, checkpoint_info, etc.). Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
|
|
||
| # Special case for VLM finetune which uses finetune.py instead of train_ft.py | ||
| if args.domain == "vlm" and args.command == "finetune": | ||
| command = "finetune" |
There was a problem hiding this comment.
btw, I also fix a bug here; previusly the aliases would make the vlm finetune map to non-existing file
|
Hello! Quick check-in on this PR. I noticed I missed adding callback integration to the If this aligns with your plans, I’m happy to iterate or expand it. If not, no worries. I know this may not be a high priority. Just want to understand the direction so I can adjust accordingly. Thanks! |
|
Thanks a lot @yuhezhang-ai , I will review this today again. I want to make sure that any callback is only consuming information and the callback code does not provide any way to modify the trainer's state. |
Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
Thanks for raising this. I agree callbacks should be observers only (consume info, not modify trainer state). Regarding what we pass into callbacks, I see two reasonable options: Option A (current): Keep passing Option B: Pass a small immutable context object (e.g., is_main_rank, world_size, checkpoint_dir, step/epoch, etc.). This reduces surface area and makes the observer-only intent clearer. Happy to go with whichever direction you prefer. btw, I've merged main and added the callback hooks for the missing one (train_biencoder). |
jgerh
left a comment
There was a problem hiding this comment.
Completed tech pubs review of .md file. Only two minor edits for our styles.
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
|
@akoumpa Is this ready for merge or more review required? |
jgerh
left a comment
There was a problem hiding this comment.
Completed tech pubs review. Only 2 minor copyedits.
|
|
||
| ## Introduction | ||
|
|
||
| Callbacks provide a flexible way to inject custom logic into the training loop without modifying recipe code. They enable integration with external systems, custom logging, metrics collection, and monitoring. |
There was a problem hiding this comment.
| Callbacks provide a flexible way to inject custom logic into the training loop without modifying recipe code. They enable integration with external systems, custom logging, metrics collection, and monitoring. | |
| Callbacks provide a flexible way to inject custom logic into the training loop without modifying the recipe code. They enable integration with external systems, custom logging, metrics collection, and monitoring. |
| - Metrics collection for external reporting | ||
| - Custom logging with prefixes | ||
|
|
||
| ### Running the Example |
There was a problem hiding this comment.
| ### Running the Example | |
| ### Run the Example |
Implements PyTorch Lightning-style callback system for Automodel recipes to enable custom integrations (e.g., Customizer metrics reporting, custom monitoring).
Key Features
on_train_start,on_train_batch_end,on_validation_end,on_save_checkpoint,on_exception,on_train_end@rank_zero_onlydecorator for distributed training convenienceMetricsSampleobjects, checkpoint info, etc.Changes
Callbackbase class andCallbackRunnerinnemo_automodel/components/callbacks/examples/llm_finetune/finetune_with_callback.pydocs/guides/callbacks.mdUsage
Validation
Unit tests: All 9 tests pass ✅
python -m unittest tests.unit_tests.recipes.test_callbacks -v Ran 9 tests in 0.006s OKExample run:
Verified callbacks execute at correct lifecycle points with custom log prefixes:
Closes #805
Note: Happy to implement this callback feature! I designed it to be familiar (PyTorch Lightning-style) and practical for real integrations. Free free to provide feedback on the design/API if you'd like any adjustments. (AI assisted with documentation generation, but I personally reviewed and refined everything to ensure quality and accuracy.)