Conversation
|
Thank you for the PR @annitang1997! I will review this in depth soon. cc @yiyixuxu too |
|
Is there any updates on the review process? 👀 Looking forward to use VidTok with diffusers. |
a-r-r-o-w
left a comment
There was a problem hiding this comment.
Thank you for the PR and congratulations for the release of your awesome work!
I did a first pass review about some changes that need to be made to make the implementation similar to remaining of the diffusers codebase. There are some core implementation details that will have to be refactored before we can merge. A good reference implementation for autoencoders can be found here:
- https://github.com/huggingface/diffusers/blob/0dec414d5bf2c7fe77684722b0a97324798bd7b3/src/diffusers/models/autoencoders/autoencoder_kl_ltx.py
- https://github.com/huggingface/diffusers/blob/0dec414d5bf2c7fe77684722b0a97324798bd7b3/src/diffusers/models/autoencoders/autoencoder_kl_mochi.py
I'd be happy to help assist in making some of these changes! 🤗
|
Hello, I have improved the code based on your feedback. Please check it. 🤗 |
|
Any updates in this thread? :) |
|
@deeptimhe Sorry for the delay, I'm on leave at the moment, and so is @yiyixuxu. I'll try to test the PR and give it a look next week when I'm back |
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks for the PR!
I left some feedbacks, one note on diffusers coding style is we try not to use too many small methods/functions. ideally all the logics are implemented in forward
I made a few examples in the review, if you can apply similar changes through out the implementation it would be great:)
|
Hello, I have cleaned the code by removing small methods/functions based on your feedback. Please check it. 🤗 |
|
Any updates in this thread? :) |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
cc @dg845 do you want to take a look at this PR? |
| b, n, _ = z.shape | ||
| z = z.reshape(b, n, self.num_codebooks, -1) | ||
|
|
||
| with torch.autocast("cuda", enabled=False): |
There was a problem hiding this comment.
Could we remove the torch.autocast call here? We prefer explicitly managing the device placement.
| return alpha * x + (1 - alpha) * x_ | ||
|
|
||
|
|
||
| class VidTokAttnBlock(nn.Module): |
There was a problem hiding this comment.
As we never use VidTokAttnBlock block directly except through VidTokAttnBlockWrapper, could we merge these into one class that inherits from nn.Module?
dg845
left a comment
There was a problem hiding this comment.
Hi @annitang1997, sorry about the delay and thanks for your patience! I think this is close to merge, my comments are mainly about making the code style more diffusers-like.
|
Hi @annitang1997, gentle ping on this. I'd be happy to help with anything :). |
|
Hi @dg845, thank you for your detailed review😊. We have made the corresponding revisions based on your comments in commit |
|
@bot /style |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Style bot fixed some files and pushed the changes. |
dg845
left a comment
There was a problem hiding this comment.
Thanks for the PR and thanks for your patience! Could you run make fix-copies to create dummy objects for AutoencoderVidTok? This will solve the CI failure in https://github.com/huggingface/diffusers/actions/runs/22082409825/job/63810212400?pr=11261.
We add VidTok, a versatile and state-of-the-art video tokenizer, as an autoencoder model to diffusers.
Paper: https://arxiv.org/pdf/2412.13061
Code: https://github.com/microsoft/VidTok
Model: https://huggingface.co/microsoft/VidTok