Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

This repository contains the official implementation of the paper "Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability".

Most existing video VAEs prioritize reconstruction fidelity, often overlooking the latent structure's impact on downstream diffusion training. Our research identifies properties of video VAE latent spaces that facilitate diffusion training through statistical analysis of VAE latents. Our key finding is that biased, rather than uniform, spectra lead to improved diffusability. Motivated by this, we introduce SSVAE (Spectral-Structured VAE), which optimizes the * spectral properties* of the latent space to enhance its "Diffusability".

🔥 Key Highlights

Spectral Analysis of Latents: We identify two statistical properties essential for efficient diffusion training: a low-frequency biased spatio-temporal spectrum and a few-mode biased channel eigenspectrum.
Local Correlation Regularization (LCR): A lightweight regularizer that explicitly enhances local spatio-temporal correlations to induce low-frequency bias.
Latent Masked Reconstruction (LMR): A mechanism that simultaneously promotes few-mode bias and improves decoder robustness against noise.
Superior Performance:
- 🚀 3× Faster Convergence: Accelerates text-to-video generation convergence by 3× compared to strong baselines.
- 📈 Higher Quality: Achieves a 10% gain in video reward scores (UnifiedReward).
- 🏆 Outperforms SOTA: Surpasses open-source VAEs (e.g., Wan 2.2, CogVideoX) in generation quality with fewer parameters.

Data preparetion

We use WebDataset to build the dataset. Please organize your data accordingly before training. Structure your dataset as follows:

data/
└── webvid/
   ├── 000000.meta.jsonl
   ├── 000000.tar
   ├── 000001.meta.jsonl
   ├── 000001.tar
   └── ...

tar files: Each tar should pack multiple video samples. Each sample contains at least ".mp4" and ".id" files. The " .id" file must exist, but the content is not important.
meta files: Each line is a JSON object describing metadata for videos within the corresponding tar. Necessary fields include key (video name), duration, fps. Example contents for 000000.meta.jsonl:
```
{"key": "1000000006", "duration": 16.0, "fps": 60, ...}
{"key": "1000000007", "duration": 29.5, "fps": 30, ...}
```

Note: Before training, update the train: dataset path in the config files to your actual data directory. Multiple paths can be separated by commas:
path: ";path/to/dataset1,path/to/dataset2,..."

Training

The default training entrypoint is provided by scripts/train.sh. We use 32 H100 GPUs for the first stage of training, and 8 GPUs for the second stage.

Stage 1: Training at 256p (150k steps)

bash scripts/train.sh configs/ch48_LCR_LMR_256p.yaml ch48_LCR_LMR_256p

Stage 2: Freeze Encoder, Decoder Finetuning at 512p (50k steps)

(Remember to replace the "ckpt_path" field in the config with the ckpt path obtained from the first stage.)

bash scripts/train.sh configs/ch48_LCR_LMR_512p_DecoderFinetune.yaml ch48_LCR_LMR_512p_DecoderFinetune

Inference

You can download our pre-trained model from https://huggingface.co/zai-org/SSVAE. The default inference entrypoint is provided by scripts/inference.sh. To run reconstruction using our pretrained VAE, use:

python reconstruction.py --config configs/inference.yaml --input assets/video/0001.mp4 --output output/

Note: Specify the path of the downloaded pretrained model in the config:
ckpt_path: "SSVAE/ch48_256p_15w_512p_5w.ckpt" ## Replace with your actual path

Note: If you encounter an error like ModuleNotFoundError: No module named 'torchvision.transforms.functional_tensor' when importing pytorchvideo, this is caused by a compatibility issue between older versions of pytorchvideo (e.g., 0.1.5) and newer versions of torchvision (where torchvision.transforms.functional_tensor has been removed).

Here is the way to fix it:

Edit the file venv/lib/python3.*/site-packages/pytorchvideo/transforms/augmentations.py and replace:
import torchvision.transforms.functional_tensor as F_t
with:
from torchvision.transforms import functional as F_t

    (int(img.shape[1] * r), int(img.shape[0] * r)),

Then rerun the inference command.

Generation Training

Generation training can be achieved by integrating SSVAE into an existing text-to-video training framework. For example, you can replace the "sat/sgm" directory of CogVideo with the "ssvae" directory from this repository and update the VAE inference configuration files accordingly to enable text-to-video training.

Citation

If you find this work useful in your research, please consider citing:

@misc{liu2025delvinglatentspectralbiasing,
      title={Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability}, 
      author={Shizhan Liu and Xinran Deng and Zhuoyi Yang and Jiayan Teng and Xiaotao Gu and Jie Tang},
      year={2025},
      eprint={2512.05394},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05394}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
configs		configs
scripts		scripts
ssvae		ssvae
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
reconstruction.py		reconstruction.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

🔥 Key Highlights

Data preparetion

Training

Stage 1: Training at 256p (150k steps)

Stage 2: Freeze Encoder, Decoder Finetuning at 512p (50k steps)

Inference

Generation Training

Citation

About

Uh oh!

Releases

Packages

Languages

License

zai-org/SSVAE

Folders and files

Latest commit

History

Repository files navigation

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

🔥 Key Highlights

Data preparetion

Training

Stage 1: Training at 256p (150k steps)

Stage 2: Freeze Encoder, Decoder Finetuning at 512p (50k steps)

Inference

Generation Training

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages