Add log_train_loss_on_step toggle to EasySyntax#886
Conversation
Adds an opt-in `log_train_loss_on_step` constructor argument that, when enabled, logs the per-batch training loss under `train_loss_step` in addition to the epoch-aggregated `train_loss`. Default is False so existing behavior is unchanged.
|
I will look at this. |
| scheduler_class: Optional[type] = None, | ||
| scheduler_kwargs: Optional[Dict] = None, | ||
| scheduler_config: Optional[Dict] = None, | ||
| log_train_loss_on_step: bool = False, |
There was a problem hiding this comment.
The variable name could be renamed to also_log_train_loss_per_step. This would immediately clarify, that it is an additional option for logging the per-batch loss under a different key.
| log_train_loss_on_step: bool = False, | |
| also_log_train_loss_per_step: bool = False, |
It could be also useful to add a Docstring explaining the arguments in __init__(), but especially for also_log_train_loss_per_step.
"""
Args:
also_log_train_loss_per_step:
If `True`, logs an additional per-batch metric (`train_loss_step`)
alongside the existing per-epoch metric (`train_loss`). This can
be useful for debugging training instabilities or monitoring
convergence within long epochs.
"""| self._scheduler_class = scheduler_class | ||
| self._scheduler_kwargs = scheduler_kwargs or dict() | ||
| self._scheduler_config = scheduler_config or dict() | ||
| self._log_train_loss_on_step = log_train_loss_on_step |
There was a problem hiding this comment.
| self._log_train_loss_on_step = log_train_loss_on_step | |
| self._also_log_train_loss_per_step = also_log_train_loss_per_step |
| if self._log_train_loss_on_step: | ||
| self.log( | ||
| "train_loss_step", | ||
| loss, | ||
| batch_size=batch_size, | ||
| prog_bar=False, | ||
| on_epoch=False, | ||
| on_step=True, | ||
| sync_dist=True, |
There was a problem hiding this comment.
It might be computationally expensive, if sync_dist=True.
There would be syncing across GPUs on every batch, which quickly adds up for high batch number. It should be maybe clarified in the Docstring at the top, that the training might be slowed down. The default of this option could also be set to sync_dist=False.
| if self._log_train_loss_on_step: | |
| self.log( | |
| "train_loss_step", | |
| loss, | |
| batch_size=batch_size, | |
| prog_bar=False, | |
| on_epoch=False, | |
| on_step=True, | |
| sync_dist=True, | |
| if self._also_log_train_loss_on_step: | |
| self.log( | |
| "train_loss_step", | |
| loss, | |
| batch_size=batch_size, | |
| prog_bar=False, | |
| on_epoch=False, | |
| on_step=True, | |
| sync_dist=True, |
There was a problem hiding this comment.
Good point! Let's set sync_dist to false as the default
Co-authored-by: Christian Locatelli <97306084+christianlocatelli@users.noreply.github.com>
christianlocatelli
left a comment
There was a problem hiding this comment.
I left some comments about optional naming and doc improvements.
The previous commit ("Apply suggestions from code review") was created via
GitHub's batch-suggestion apply, which mangled the indentation and left a
name mismatch, so the module no longer imported:
- under-indented `also_log_train_loss_per_step` parameter and attribute
- top-level `if self._also_log_train_loss_on_step:` referencing an attribute
that is never set (`_on_step` vs `_per_step`)
Re-apply the reviewer's intent cleanly: rename to `also_log_train_loss_per_step`,
log the per-step metric with `sync_dist=False`, and document all `__init__`
arguments.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Hey @christianlocatelli, I implemented your suggestions. Sorry for the commit name, that was claude 😅 |
christianlocatelli
left a comment
There was a problem hiding this comment.
This looks good to me, thanks for editing the code 👍
|
@Aske-Rosted tagging you here for completeness (and a potential approval 😅 ) |
| scheduler_class: Optional[type] = None, | ||
| scheduler_kwargs: Optional[Dict] = None, | ||
| scheduler_config: Optional[Dict] = None, | ||
| also_log_train_loss_per_step: bool = False, |
There was a problem hiding this comment.
I am sorry for being a little bit annoying here but I think it is better to just expose the on_epoch and on_step from the lightning library then the user can decide themselves whether they want just on step, or just on epoch, on both or on neither. The standard values should follow current behavior.
| also_log_train_loss_per_step: bool = False, | |
| log_on_epoch: bool = True | |
| log_on_step: bool = False, |
| on_epoch=True, | ||
| on_step=False, |
There was a problem hiding this comment.
And then expose the logging setting to the class. This removes duplicate code and then we don't have to define the batch_size.
| on_epoch=True, | |
| on_step=False, | |
| on_epoch=log_on_epoch, | |
| on_step=log_on_step, |
There was a problem hiding this comment.
makes sense, but how do we want to handle the logging of the val loss? log_on_epoch and log_on_step for me sounds like you log both val and train on epoch and or step, which I think could also be valid (personally I log train on log and step and val only on epoch). As long as we agree on something together I think either way is fine
There was a problem hiding this comment.
I think it is fine to have the logging of the validation and train loss behave in the same way. In principle we could separate the arguments for validation and training, but I think that is a little too many arguments, and moving towards instances where people should just create their own torch-lightning callbacks.
Summary
log_train_loss_on_stepflag toEasySyntaxthat logs a per-steptrain_loss_stepmetric in addition to the existing epoch-aggregatedtrain_loss.False, so existing logging behavior is unchanged.addresses #882