feat: callback support #805 by yuhezhang-ai · Pull Request #875 · NVIDIA-NeMo/Automodel

yuhezhang-ai · 2025-11-23T02:37:22Z

Implements PyTorch Lightning-style callback system for Automodel recipes to enable custom integrations (e.g., Customizer metrics reporting, custom monitoring).

Key Features

6 lifecycle hooks: on_train_start, on_train_batch_end, on_validation_end, on_save_checkpoint, on_exception, on_train_end
@rank_zero_only decorator for distributed training convenience
Full training context passed to callbacks: recipe object, MetricsSample objects, checkpoint info, etc.
Programmatic API: Callbacks passed directly to recipe constructors

Changes

Added Callback base class and CallbackRunner in nemo_automodel/components/callbacks/
Integrated callbacks into all recipe classes (LLM finetune, VLM finetune, sequence classification, knowledge distillation)
Comprehensive example: examples/llm_finetune/finetune_with_callback.py
9 unit tests covering all hooks and distributed behavior
Documentation: docs/guides/callbacks.md

Usage

from nemo_automodel.components.callbacks import Callback

class MetricsReporter(Callback):
    def on_train_batch_end(self, recipe, **kwargs):
        log_data = kwargs['train_log_data']
        # Send metrics to external system
        
recipe = TrainFinetuneRecipeForNextTokenPrediction(
    cfg, 
    callbacks=[MetricsReporter()]
)

Validation

Unit tests: All 9 tests pass ✅

python -m unittest tests.unit_tests.recipes.test_callbacks -v
Ran 9 tests in 0.006s
OK

Example run:

python examples/llm_finetune/finetune_with_callback.py

Verified callbacks execute at correct lifecycle points with custom log prefixes:

2025-11-22 22:36:19 | INFO | __main__ | [SimpleLoggingCallback] 🔥 Training is starting!
2025-11-22 22:36:19 | INFO | __main__ | [SimpleLoggingCallback]    World size: 1 GPUs
2025-11-22 22:36:19 | INFO | __main__ | [SimpleLoggingCallback]    Total steps: 100
2025-11-22 22:37:33 | INFO | __main__ | [SimpleLoggingCallback] 🚀 Step 0/100: Loss = 0.9377, LR = 1.00e-05
...
2025-11-22 22:52:03 | INFO | __main__ | [SimpleLoggingCallback] ✅ Validation 'default': Loss = 0.2208
2025-11-22 22:52:03 | INFO | __main__ | [MetricsCollectorCallback] 📊 Collected 2 validation checkpoints
2025-11-22 22:52:03 | INFO | __main__ | [SimpleLoggingCallback] 💾 Checkpoint saved at step 99, epoch 0, path: checkpoints/epoch_0_step_99
2025-11-22 22:52:03 | INFO | __main__ | [MetricsCollectorCallback] 💾 Tracked checkpoint 2: step=99, train_loss=0.1678
2025-11-22 22:52:03 | INFO | __main__ | [SimpleLoggingCallback] 🎉 Training completed successfully! Final step: 100
2025-11-22 22:52:03 | INFO | __main__ | [MetricsCollectorCallback] 🎉 Training complete! Collected 100 training steps

Closes #805

Note: Happy to implement this callback feature! I designed it to be familiar (PyTorch Lightning-style) and practical for real integrations. Free free to provide feedback on the design/API if you'd like any adjustments. (AI assisted with documentation generation, but I personally reviewed and refined everything to ensure quality and accuracy.)

…o#805) Add PyTorch Lightning-style callbacks with hooks for train start/end, batch end, validation end, checkpoint save, and exceptions. Includes @rank_zero_only decorator for distributed training. Callbacks are passed programmatically to recipe constructors and receive full training context (recipe object, MetricsSample, checkpoint_info, etc.). Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

copy-pr-bot · 2025-11-23T02:37:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

yuhezhang-ai · 2025-11-23T02:46:04Z

+
+    # Special case for VLM finetune which uses finetune.py instead of train_ft.py
+    if args.domain == "vlm" and args.command == "finetune":
+        command = "finetune"


btw, I also fix a bug here; previusly the aliases would make the vlm finetune map to non-existing file

yuhezhang-ai · 2025-12-06T04:18:51Z

Hello! Quick check-in on this PR. I noticed I missed adding callback integration to the train_biencoder recipe and can include that here. Before doing that, I wanted to confirm whether callback support is something you're interested in bringing into Automodel.

If this aligns with your plans, I’m happy to iterate or expand it. If not, no worries. I know this may not be a high priority. Just want to understand the direction so I can adjust accordingly.

Thanks!

akoumpa · 2026-01-15T18:50:21Z

Thanks a lot @yuhezhang-ai , I will review this today again.

I want to make sure that any callback is only consuming information and the callback code does not provide any way to modify the trainer's state.

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

yuhezhang-ai · 2026-01-22T23:00:40Z

Thanks a lot @yuhezhang-ai , I will review this today again.

I want to make sure that any callback is only consuming information and the callback code does not provide any way to modify the trainer's state.

Thanks for raising this. I agree callbacks should be observers only (consume info, not modify trainer state).

Regarding what we pass into callbacks, I see two reasonable options:

Option A (current): Keep passing recipe object. The upside is flexibility: callbacks can access whatever metadata they need (distributed info, step, etc.) without us having to expand the API every time a new use case comes up. We can also document clearly that callbacks should be read-only.

Option B: Pass a small immutable context object (e.g., is_main_rank, world_size, checkpoint_dir, step/epoch, etc.). This reduces surface area and makes the observer-only intent clearer.

Happy to go with whichever direction you prefer.

btw, I've merged main and added the callback hooks for the missing one (train_biencoder).

jgerh

Completed tech pubs review of .md file. Only two minor edits for our styles.

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

bernardwin · 2026-03-11T18:56:16Z

@akoumpa Is this ready for merge or more review required?

jgerh

Completed tech pubs review. Only 2 minor copyedits.

jgerh · 2026-03-16T21:21:22Z

+
+## Introduction
+
+Callbacks provide a flexible way to inject custom logic into the training loop without modifying recipe code. They enable integration with external systems, custom logging, metrics collection, and monitoring.


Suggested change

Callbacks provide a flexible way to inject custom logic into the training loop without modifying recipe code. They enable integration with external systems, custom logging, metrics collection, and monitoring.

Callbacks provide a flexible way to inject custom logic into the training loop without modifying the recipe code. They enable integration with external systems, custom logging, metrics collection, and monitoring.

jgerh · 2026-03-16T21:21:42Z

+- Metrics collection for external reporting
+- Custom logging with prefixes
+
+### Running the Example


Suggested change

### Running the Example

### Run the Example

Yuhe Zhang added 2 commits November 22, 2025 12:41

Add and refine docs for callback

b8833a2

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

yuhezhang-ai requested review from HuiyingLi, adil-a, akoumpa and hemildesai as code owners November 23, 2025 02:37

github-actions Bot added the community-request label Nov 23, 2025

Yuhe Zhang added 2 commits November 22, 2025 21:38

Merge branch 'main' into yuhe/805/callback-support

38f54ea

clean up

5e55b42

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

yuhezhang-ai commented Nov 23, 2025

View reviewed changes

snowmanwwg added x-pixieset external labels Dec 9, 2025

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 15, 2026

Yuhe Zhang added 2 commits January 22, 2026 17:16

Merge upstream/main into yuhe/805/callback-support

6427461

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

add callback hooks to train_biencoder.py

35ede56

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

yuhezhang-ai requested review from a team and jgerh as code owners January 22, 2026 22:40

jgerh reviewed Jan 23, 2026

View reviewed changes

Comment thread docs/guides/callbacks.md Outdated

Comment thread docs/guides/callbacks.md Outdated

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 24, 2026

yuhezhang-ai and others added 2 commits January 27, 2026 19:40

Update docs/guides/callbacks.md

5f0bbb7

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Update docs/guides/callbacks.md

a280875

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 2, 2026

jgerh reviewed Mar 16, 2026

View reviewed changes

chtruong814 added the waiting-for-customer label Apr 14, 2026

chtruong814 added waiting-on-customer Waiting on the original author to respond and removed waiting-for-customer labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: callback support #805#875

feat: callback support #805#875
yuhezhang-ai wants to merge 8 commits into
NVIDIA-NeMo:mainfrom
yuhezhang-ai:yuhe/805/callback-support

yuhezhang-ai commented Nov 23, 2025

Uh oh!

copy-pr-bot Bot commented Nov 23, 2025

Uh oh!

yuhezhang-ai Nov 23, 2025

Uh oh!

yuhezhang-ai commented Dec 6, 2025

Uh oh!

akoumpa commented Jan 15, 2026

Uh oh!

yuhezhang-ai commented Jan 22, 2026

Uh oh!

jgerh left a comment

Uh oh!

Uh oh!

Uh oh!

bernardwin commented Mar 11, 2026

Uh oh!

jgerh left a comment

Uh oh!

jgerh Mar 16, 2026

Uh oh!

jgerh Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants


		## Introduction

		Callbacks provide a flexible way to inject custom logic into the training loop without modifying recipe code. They enable integration with external systems, custom logging, metrics collection, and monitoring.

Conversation

yuhezhang-ai commented Nov 23, 2025

Key Features

Changes

Usage

Validation

Uh oh!

copy-pr-bot Bot commented Nov 23, 2025

Uh oh!

yuhezhang-ai Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

yuhezhang-ai commented Dec 6, 2025

Uh oh!

akoumpa commented Jan 15, 2026

Uh oh!

yuhezhang-ai commented Jan 22, 2026

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bernardwin commented Mar 11, 2026

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

jgerh Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

jgerh Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants