SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows

We propose a simple end-to-end framework to train a VAE and an NF jointly from scratch, outperforming the prior NF-based model STARFlow, which uses a frozen VAE encoder.

Qinyu Zhao1,2·Guangting Zheng2·Tao Yang2·Rui Zhu2†·Xingjian Leng1·Stephen Gould1·Liang Zheng1

1 Australian National University   2 ByteDance Seed   Project Lead  

Background

Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations.

Method: SimFlow

We propose SimFlow to improve the performance of NFs. Our key design is to fix the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5).

Method
Comparison of our framework with closely related methods. Solid arrows indicate the forward pass, while dashed arrows denote gradient flows. We label frozen modules in gray, generative models in green, and VAE modules involved in training in red.

Key Insights

1. VAEs with fixed variances are more robust to latent perturbation.

VAE Robustness
Robustness of VAEs with fixed variances. (a) A VAE with a large and fixed variance can maintain reconstruction quality under latent noise, while the performance of a VAE with learnable variance degrades significantly. (b) For VAEs with a large variance, the images reconstructed from linearly interpolated latents still clearly show the main subjects (the cat or the dog), rather than blending them. 'Learnable' indicates a standard VAE with learnable variance, while '$\bar{\sigma}^2=x^2$' denotes a VAE with a fixed variance of $x^2$.

2. End-to-end training makes latent space possibly more suitable for developing generative models.

VAE Latent Space
End-to-end training affects the latent space, making it possibly more suitable for developing generative models.

3. Variant studies.

Experimental Results

On ImageNet class-conditional generation, our method SimFlow establishes a new state-of-the-art, significantly outperforming NF-based baselines.

ImageNet $256\times 256$

Results 256x256
Comparison on ImageNet $256\times 256$. SimFlow (with or without REPA-E) achieves better performance than STARFlow on ImageNet $256\times 256$.

ImageNet $512\times 512$

Results 512x512
Comparison on ImageNet $512\times 512$. SimFlow with REPA-E achieves better performance than STARFlow on ImageNet $512\times 512$.

Qualitative Results

Examples
Selected samples. More uncurated samples can be found in the paper.

Conclusion

This paper presents SimFlow, an end-to-end training framework for latent NFs by simply fixing the VAE variance. This makes latent space smoother and helps NFs generalize better when sampling, without needing extra noise schedules or denoising steps. Experiments show that SimFlow improves generation quality and speeds up training compared to existing NF methods. Future work will expand this framework to text-to-image training and explore a second-stage training with the VAE fixed after joint training.

Appendix

We thank the REPA, REPA-E, and FreeFlow projects for the website template.

@article{zhao2025simflow,
  title={SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows},
  author={Zhao, Qinyu and Zheng, Guangting and Yang, Tao and Zhu, Rui
    and Leng, Xingjian and Gould, Stephen and Zheng, Liang},
  year={2025},
  journal={arXiv preprint},
}