Add VidTok AutoEncoders by annitang1997 · Pull Request #11261 · huggingface/diffusers

annitang1997 · 2025-04-09T17:07:11Z

We add VidTok, a versatile and state-of-the-art video tokenizer, as an autoencoder model to diffusers.

Paper: https://arxiv.org/pdf/2412.13061
Code: https://github.com/microsoft/VidTok
Model: https://huggingface.co/microsoft/VidTok

a-r-r-o-w · 2025-04-10T06:53:04Z

Thank you for the PR @annitang1997! I will review this in depth soon. cc @yiyixuxu too

deeptimhe · 2025-04-20T09:45:44Z

Is there any updates on the review process? 👀 Looking forward to use VidTok with diffusers.

a-r-r-o-w

Thank you for the PR and congratulations for the release of your awesome work!

I did a first pass review about some changes that need to be made to make the implementation similar to remaining of the diffusers codebase. There are some core implementation details that will have to be refactored before we can merge. A good reference implementation for autoencoders can be found here:

I'd be happy to help assist in making some of these changes! 🤗

src/diffusers/models/autoencoders/vae.py

src/diffusers/models/downsampling.py

src/diffusers/models/normalization.py

src/diffusers/models/upsampling.py

src/diffusers/models/autoencoders/autoencoder_vidtok.py

annitang1997 · 2025-05-09T16:30:52Z

Hello, I have improved the code based on your feedback. Please check it. 🤗

deeptimhe · 2025-05-23T10:17:41Z

Any updates in this thread? :)

a-r-r-o-w · 2025-05-23T19:34:07Z

@deeptimhe Sorry for the delay, I'm on leave at the moment, and so is @yiyixuxu. I'll try to test the PR and give it a look next week when I'm back

yiyixuxu

thanks for the PR!
I left some feedbacks, one note on diffusers coding style is we try not to use too many small methods/functions. ideally all the logics are implemented in forward

I made a few examples in the review, if you can apply similar changes through out the implementation it would be great:)

src/diffusers/models/autoencoders/autoencoder_vidtok.py

annitang1997 · 2025-07-02T07:09:40Z

Hello, I have cleaned the code by removing small methods/functions based on your feedback. Please check it. 🤗

annitang1997 · 2025-07-28T14:17:06Z

Any updates in this thread? :)

github-actions · 2026-01-09T15:23:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yiyixuxu · 2026-01-09T18:26:29Z

cc @dg845 do you want to take a look at this PR?

src/diffusers/models/autoencoders/autoencoder_vidtok.py

dg845 · 2026-01-10T00:26:22Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        b, n, _ = z.shape
+        z = z.reshape(b, n, self.num_codebooks, -1)
+
+        with torch.autocast("cuda", enabled=False):


Could we remove the torch.autocast call here? We prefer explicitly managing the device placement.

src/diffusers/models/autoencoders/autoencoder_vidtok.py

dg845 · 2026-01-10T00:36:46Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        return alpha * x + (1 - alpha) * x_
+
+
+class VidTokAttnBlock(nn.Module):


As we never use VidTokAttnBlock block directly except through VidTokAttnBlockWrapper, could we merge these into one class that inherits from nn.Module?

src/diffusers/models/autoencoders/autoencoder_vidtok.py

dg845

Hi @annitang1997, sorry about the delay and thanks for your patience! I think this is close to merge, my comments are mainly about making the code style more diffusers-like.

dg845 · 2026-01-21T07:09:54Z

Hi @annitang1997, gentle ping on this. I'd be happy to help with anything :).

annitang1997 · 2026-02-13T17:36:49Z

Hi @dg845, thank you for your detailed review😊. We have made the corresponding revisions based on your comments in commit b9a86c4. Please kindly check again.

dg845 · 2026-02-17T01:00:30Z

@bot /style

HuggingFaceDocBuilderDev · 2026-02-17T01:00:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-02-17T01:00:54Z

Style bot fixed some files and pushed the changes.

dg845

Thanks for the PR and thanks for your patience! Could you run make fix-copies to create dummy objects for AutoencoderVidTok? This will solve the CI failure in https://github.com/huggingface/diffusers/actions/runs/22082409825/job/63810212400?pr=11261.

annitang1997 added 2 commits April 10, 2025 00:47

add_autoencoder_vidtok

371aa27

Merge branch 'main' into add_autoencoder_vidtok

b2dc1ef

Merge branch 'main' into add_autoencoder_vidtok

0ce6be7

a-r-r-o-w reviewed Apr 21, 2025

View reviewed changes

annitang1997 added 3 commits May 3, 2025 13:05

Merge branch 'huggingface:main' into add_autoencoder_vidtok

f0f5c58

format standardization

b4e1deb

Merge branch 'huggingface:main' into add_autoencoder_vidtok

a466717

annitang1997 added 2 commits June 11, 2025 20:28

Merge branch 'main' into add_autoencoder_vidtok

4c4c051

Merge branch 'main' into add_autoencoder_vidtok

1ad58e5

yiyixuxu reviewed Jun 17, 2025

View reviewed changes

annitang1997 added 2 commits July 2, 2025 13:40

Merge branch 'huggingface:main' into add_autoencoder_vidtok

f552028

remove small functions

3506971

github-actions bot added the stale Issues that haven't received updates label Jan 9, 2026

yiyixuxu removed the stale Issues that haven't received updates label Jan 9, 2026

yiyixuxu requested a review from dg845 January 9, 2026 18:26