SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows

Qinyu Zhao^1,2 · Guangting Zheng² · Tao Yang² · Rui Zhu^2† · Xingjian Leng¹ · Stephen Gould¹ · Liang Zheng¹

¹ Australian National University ² ByteDance Seed ^†Project Lead

Background

Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations.

First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps.
Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality.

Method: SimFlow

We propose SimFlow to improve the performance of NFs. Our key design is to fix the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5).

Simple: this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design in STARFlow.
End-to-end: fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly.
Good performance: on the ImageNet $256\times 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the prior work STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with REPA-E method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.

Method — **Comparison of our framework with closely related methods.** Solid arrows indicate the forward pass, while dashed arrows denote gradient flows. We label frozen modules in gray, generative models in green, and VAE modules involved in training in red.

Key Insights

1. VAEs with fixed variances are more robust to latent perturbation.

A VAE with a large and fixed variance can maintain reconstruction quality under latent noise, while the performance of a VAE with learnable variance degrades significantly.
For VAEs with a large variance, the images reconstructed from linearly interpolated latents still clearly show the main subjects (the cat or the dog), rather than blending them.

VAE Robustness — **Robustness of VAEs with fixed variances.** (a) A VAE with a large and fixed variance can maintain reconstruction quality under latent noise, while the performance of a VAE with learnable variance degrades significantly. (b) For VAEs with a large variance, the images reconstructed from linearly interpolated latents still clearly show the main subjects (the cat or the dog), rather than blending them. 'Learnable' indicates a standard VAE with learnable variance, while '$\bar{\sigma}^2=x^2$' denotes a VAE with a fixed variance of $x^2$.

2. End-to-end training makes latent space possibly more suitable for developing generative models.

Spectral entropy measures the randomness of frequency components; lower values indicate simpler data distributions in the frequency domain.
End-to-end latent sapce has a lower ratio of high-frequency components.
Total variation captures the overall local changes across tokens; lower values imply smoother latents.
Autocorrelation reflects how similar a token sequence is to a shifted version of itself; higher autocorrelation indicates stronger spatial consistency.

VAE Latent Space — **End-to-end training affects the latent space, making it possibly more suitable for developing generative models.**

3. Variant studies.

Only setting a fixed variance is not enough.

Good reconstruction does not mean good generation quality.

The end-to-end training significantly improves the generation quality.

Variant Studies — **Variant studies.**'Frozen VAE' means both VAE encoder and decoder are frozen during training. 'Frozen enc' means the decoder is trained. 'End-to-end' means VAE encoder and decoder, and the NF are jointly trained from scratch. 'Learnable var' means the variance is predicted by the VAE, while 'Fixed var' is our method with $\bar{\sigma}^2=0.5^2$. 'LN' denotes applying a layer normalization on the VAE encoder. 'Noise augmented' indicates adding Gaussian noise to VAE latents as done by STARFlow.

Experimental Results

On ImageNet class-conditional generation, our method SimFlow establishes a new state-of-the-art, significantly outperforming NF-based baselines.

ImageNet $256\times 256$

Results 256x256 — **Comparison on ImageNet $256\times 256$.** SimFlow (with or without REPA-E) achieves better performance than STARFlow on ImageNet $256\times 256$.

ImageNet $512\times 512$

Results 512x512 — **Comparison on ImageNet $512\times 512$.** SimFlow with REPA-E achieves better performance than STARFlow on ImageNet $512\times 512$.

Qualitative Results

Examples — **Selected samples.** More uncurated samples can be found in the paper.

Conclusion

This paper presents SimFlow, an end-to-end training framework for latent NFs by simply fixing the VAE variance. This makes latent space smoother and helps NFs generalize better when sampling, without needing extra noise schedules or denoising steps. Experiments show that SimFlow improves generation quality and speeds up training compared to existing NF methods. Future work will expand this framework to text-to-image training and explore a second-stage training with the VAE fixed after joint training.

Read the Full Paper

Appendix

We thank the REPA, REPA-E, and FreeFlow projects for the website template.

@article{zhao2025simflow,
  title={SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows},
  author={Zhao, Qinyu and Zheng, Guangting and Yang, Tao and Zhu, Rui
    and Leng, Xingjian and Gould, Stephen and Zheng, Liang},
  year={2025},
  journal={arXiv preprint},
}