Skip to content

Add VidTok AutoEncoders#11261

Open
annitang1997 wants to merge 17 commits intohuggingface:mainfrom
annitang1997:add_autoencoder_vidtok
Open

Add VidTok AutoEncoders#11261
annitang1997 wants to merge 17 commits intohuggingface:mainfrom
annitang1997:add_autoencoder_vidtok

Conversation

@annitang1997
Copy link

We add VidTok, a versatile and state-of-the-art video tokenizer, as an autoencoder model to diffusers.

Paper: https://arxiv.org/pdf/2412.13061
Code: https://github.com/microsoft/VidTok
Model: https://huggingface.co/microsoft/VidTok

@a-r-r-o-w
Copy link
Contributor

Thank you for the PR @annitang1997! I will review this in depth soon. cc @yiyixuxu too

@deeptimhe
Copy link

deeptimhe commented Apr 20, 2025

Is there any updates on the review process? 👀 Looking forward to use VidTok with diffusers.

Copy link
Contributor

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR and congratulations for the release of your awesome work!

I did a first pass review about some changes that need to be made to make the implementation similar to remaining of the diffusers codebase. There are some core implementation details that will have to be refactored before we can merge. A good reference implementation for autoencoders can be found here:

I'd be happy to help assist in making some of these changes! 🤗

@annitang1997
Copy link
Author

annitang1997 commented May 9, 2025

Hello, I have improved the code based on your feedback. Please check it. 🤗

@deeptimhe
Copy link

Any updates in this thread? :)

@a-r-r-o-w
Copy link
Contributor

@deeptimhe Sorry for the delay, I'm on leave at the moment, and so is @yiyixuxu. I'll try to test the PR and give it a look next week when I'm back

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!
I left some feedbacks, one note on diffusers coding style is we try not to use too many small methods/functions. ideally all the logics are implemented in forward

I made a few examples in the review, if you can apply similar changes through out the implementation it would be great:)

@annitang1997
Copy link
Author

Hello, I have cleaned the code by removing small methods/functions based on your feedback. Please check it. 🤗

@annitang1997
Copy link
Author

Any updates in this thread? :)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 9, 2026
@yiyixuxu yiyixuxu removed the stale Issues that haven't received updates label Jan 9, 2026
@yiyixuxu yiyixuxu requested a review from dg845 January 9, 2026 18:26
@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jan 9, 2026

cc @dg845 do you want to take a look at this PR?

b, n, _ = z.shape
z = z.reshape(b, n, self.num_codebooks, -1)

with torch.autocast("cuda", enabled=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the torch.autocast call here? We prefer explicitly managing the device placement.

return alpha * x + (1 - alpha) * x_


class VidTokAttnBlock(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we never use VidTokAttnBlock block directly except through VidTokAttnBlockWrapper, could we merge these into one class that inherits from nn.Module?

Copy link
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @annitang1997, sorry about the delay and thanks for your patience! I think this is close to merge, my comments are mainly about making the code style more diffusers-like.

@dg845
Copy link
Collaborator

dg845 commented Jan 21, 2026

Hi @annitang1997, gentle ping on this. I'd be happy to help with anything :).

@annitang1997
Copy link
Author

annitang1997 commented Feb 13, 2026

Hi @dg845, thank you for your detailed review😊. We have made the corresponding revisions based on your comments in commit b9a86c4. Please kindly check again.

@dg845
Copy link
Collaborator

dg845 commented Feb 17, 2026

@bot /style

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 17, 2026

Style bot fixed some files and pushed the changes.

Copy link
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and thanks for your patience! Could you run make fix-copies to create dummy objects for AutoencoderVidTok? This will solve the CI failure in https://github.com/huggingface/diffusers/actions/runs/22082409825/job/63810212400?pr=11261.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments