You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for the great work!
I have a question regarding the logic in calculate_dimensions.
Currently, the image height and width are constrained to be multiples of 32.
From my understanding:
The VAE has a downsampling factor of 8, so the latent spatial size should require the input dimensions to be multiples of 8.
Before entering the DiT, the latent is passed through a Patch Embedding layer with patch_size = 2.
That would further imply a total factor of 8 × 2 = 16.
Based on this, it seems that constraining the image dimensions to be multiples of 16 should already be sufficient.
Could you clarify why a multiple of 32 is required here?
Is there an additional downsampling stage, architectural constraint, or implementation detail that I might be missing?
Thanks in advance for the clarification!