Excellent research work! Could you please provide the code for training the video tokenizer? It is essential for reproducing the entire pipeline.