Skip to content

Add log_train_loss_on_step toggle to EasySyntax#886

Open
sevmag wants to merge 5 commits into
graphnet-team:mainfrom
sevmag:feature/log-train-loss-on-step
Open

Add log_train_loss_on_step toggle to EasySyntax#886
sevmag wants to merge 5 commits into
graphnet-team:mainfrom
sevmag:feature/log-train-loss-on-step

Conversation

@sevmag
Copy link
Copy Markdown
Collaborator

@sevmag sevmag commented May 4, 2026

Summary

  • Adds an opt-in log_train_loss_on_step flag to EasySyntax that logs a per-step train_loss_step metric in addition to the existing epoch-aggregated train_loss.
  • Default is False, so existing logging behavior is unchanged.

addresses #882

Adds an opt-in `log_train_loss_on_step` constructor argument that, when
enabled, logs the per-batch training loss under `train_loss_step` in
addition to the epoch-aggregated `train_loss`. Default is False so
existing behavior is unchanged.
@christianlocatelli
Copy link
Copy Markdown
Collaborator

I will look at this.

@christianlocatelli christianlocatelli self-requested a review May 12, 2026 18:33
Comment thread src/graphnet/models/easy_model.py Outdated
scheduler_class: Optional[type] = None,
scheduler_kwargs: Optional[Dict] = None,
scheduler_config: Optional[Dict] = None,
log_train_loss_on_step: bool = False,
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name could be renamed to also_log_train_loss_per_step. This would immediately clarify, that it is an additional option for logging the per-batch loss under a different key.

Suggested change
log_train_loss_on_step: bool = False,
also_log_train_loss_per_step: bool = False,

It could be also useful to add a Docstring explaining the arguments in __init__(), but especially for also_log_train_loss_per_step.

    """
    Args:
        also_log_train_loss_per_step:
            If `True`, logs an additional per-batch metric (`train_loss_step`)
            alongside the existing per-epoch metric (`train_loss`). This can
            be useful for debugging training instabilities or monitoring
            convergence within long epochs.
    """

Comment thread src/graphnet/models/easy_model.py Outdated
self._scheduler_class = scheduler_class
self._scheduler_kwargs = scheduler_kwargs or dict()
self._scheduler_config = scheduler_config or dict()
self._log_train_loss_on_step = log_train_loss_on_step
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._log_train_loss_on_step = log_train_loss_on_step
self._also_log_train_loss_per_step = also_log_train_loss_per_step

Comment thread src/graphnet/models/easy_model.py Outdated
Comment on lines +258 to +266
if self._log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be computationally expensive, if sync_dist=True.
There would be syncing across GPUs on every batch, which quickly adds up for high batch number. It should be maybe clarified in the Docstring at the top, that the training might be slowed down. The default of this option could also be set to sync_dist=False.

Suggested change
if self._log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,
if self._also_log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's set sync_dist to false as the default

Co-authored-by: Christian Locatelli <97306084+christianlocatelli@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about optional naming and doc improvements.

The previous commit ("Apply suggestions from code review") was created via
GitHub's batch-suggestion apply, which mangled the indentation and left a
name mismatch, so the module no longer imported:
- under-indented `also_log_train_loss_per_step` parameter and attribute
- top-level `if self._also_log_train_loss_on_step:` referencing an attribute
  that is never set (`_on_step` vs `_per_step`)

Re-apply the reviewer's intent cleanly: rename to `also_log_train_loss_per_step`,
log the per-step metric with `sync_dist=False`, and document all `__init__`
arguments.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sevmag
Copy link
Copy Markdown
Collaborator Author

sevmag commented May 27, 2026

Hey @christianlocatelli, I implemented your suggestions. Sorry for the commit name, that was claude 😅

@sevmag sevmag requested a review from christianlocatelli May 27, 2026 01:59
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for editing the code 👍

@sevmag
Copy link
Copy Markdown
Collaborator Author

sevmag commented May 27, 2026

@Aske-Rosted tagging you here for completeness (and a potential approval 😅 )

Comment thread src/graphnet/models/easy_model.py Outdated
scheduler_class: Optional[type] = None,
scheduler_kwargs: Optional[Dict] = None,
scheduler_config: Optional[Dict] = None,
also_log_train_loss_per_step: bool = False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for being a little bit annoying here but I think it is better to just expose the on_epoch and on_step from the lightning library then the user can decide themselves whether they want just on step, or just on epoch, on both or on neither. The standard values should follow current behavior.

Suggested change
also_log_train_loss_per_step: bool = False,
log_on_epoch: bool = True
log_on_step: bool = False,

Comment thread src/graphnet/models/easy_model.py Outdated
Comment on lines 268 to 269
on_epoch=True,
on_step=False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then expose the logging setting to the class. This removes duplicate code and then we don't have to define the batch_size.

Suggested change
on_epoch=True,
on_step=False,
on_epoch=log_on_epoch,
on_step=log_on_step,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, but how do we want to handle the logging of the val loss? log_on_epoch and log_on_step for me sounds like you log both val and train on epoch and or step, which I think could also be valid (personally I log train on log and step and val only on epoch). As long as we agree on something together I think either way is fine

Copy link
Copy Markdown
Collaborator

@Aske-Rosted Aske-Rosted May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to have the logging of the validation and train loss behave in the same way. In principle we could separate the arguments for validation and training, but I think that is a little too many arguments, and moving towards instances where people should just create their own torch-lightning callbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants