We propose a simple end-to-end framework to train a VAE and an NF jointly from scratch, outperforming the prior NF-based model STARFlow, which uses a frozen VAE encoder.
Qinyu Zhao1,2 · Guangting Zheng2 · Tao Yang2 · Rui Zhu2† · Xingjian Leng1 · Stephen Gould1 · Liang Zheng1
1 Australian National University 2 ByteDance Seed
†Project Lead
Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations.
We propose SimFlow to improve the performance of NFs. Our key design is to fix the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5).
1. VAEs with fixed variances are more robust to latent perturbation.
2. End-to-end training makes latent space possibly more suitable for developing generative models.
3. Variant studies.
On ImageNet class-conditional generation, our method SimFlow establishes a new state-of-the-art, significantly outperforming NF-based baselines.
This paper presents SimFlow, an end-to-end training framework for latent NFs by simply fixing the VAE variance. This makes latent space smoother and helps NFs generalize better when sampling, without needing extra noise schedules or denoising steps. Experiments show that SimFlow improves generation quality and speeds up training compared to existing NF methods. Future work will expand this framework to text-to-image training and explore a second-stage training with the VAE fixed after joint training.
We thank the REPA, REPA-E, and FreeFlow projects for the website template.
@article{zhao2025simflow,
title={SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows},
author={Zhao, Qinyu and Zheng, Guangting and Yang, Tao and Zhu, Rui
and Leng, Xingjian and Gould, Stephen and Zheng, Liang},
year={2025},
journal={arXiv preprint},
}