Glow: Better reversible generative models

We introduce _Glow_, a reversible generative model which uses invertible 1x1 convolutions. It extendsprevious⁠(opens in a new window)work⁠(opens in a new window)on reversible generative models and simplifies the architecture. Our model can generate realistic high resolution images, supports efficient sampling, and discovers features that can be used to manipulate attributes of data. We’re releasing code for the model and an online visualization tool so people can explore and build on these results.

Generative modeling is about observing data, like a set of pictures of faces, then learning a model of how this data was generated. Learning to approximate the data-generating process requires learning _all structure_ present in the data, and successful models should be able to synthesize outputs that look similar to the data. Accurate generative models have broad applications, includingspeech synthesis⁠(opens in a new window),text analysis and synthesis⁠(opens in a new window),semi-supervised learning⁠(opens in a new window)andmodel-based control⁠(opens in a new window). The technique we propose can be applied to those problems as well.

Glow is a type of reversible generative model, also called _flow-based generative model_, and is an extension of theNICE⁠(opens in a new window)andRealNVP⁠(opens in a new window)techniques. Flow-based generative models have so far gained little attention in the research community compared toGANs⁠(opens in a new window)andVAEs⁠(opens in a new window).

Some of the merits of flow-based generative models include:

Using our techniques we achieve significant improvements on standard benchmarks compared to RealNVP, the previous best published result with flow-based generative models.

DatasetRealNVPGlow CIFAR-10 3.49 3.55 Imagenet 32x32 4.28 4.09 Imagenet 64x64 3.98 3.81 LSUN (bedroom)2.72 2.38 LSUN (tower)2.81 2.46 LSUN (church outdoor)3.08 2.67

Quantitative performance in terms of bits per dimension evaluated on the test set of various datasets, for theRealNVP model⁠(opens in a new window)versus our Glow model.*

Glow models can generate realistic-looking high-resolution images, and can do so efficiently. Our model takes about 130ms to generate a 256 x 256 sample on a NVIDIA 1080 Ti GPU. Likeprevious⁠(opens in a new window)work, we found that sampling from a reduced-temperature model often results in higher-quality samples. The samples above were obtained by scaling the standard deviation of the latents by a temperature of 0.7.

### Interpolation in latent space

We can also interpolate between arbitrary faces, by using the encoder to encode the two images and sample from intermediate points. Note that the inputs are arbitrary faces and not samples from the model, thus providing evidence that the model has support over the full target distribution.

### Manipulation in latent space

We can train a flow-based model, without labels, and then use the learned latent reprentation for downstream tasks like manipulating attributes of your input. These semantic attributes could be the color of hair in a face, the style of an image, the pitch of a musical sound, or the emotion of a text sentence. Since flow-based models have a perfect encoder, you can encode inputs and compute the average latent vector of inputs with and without the attribute. The vector direction between the two can then be used to manipulate an arbitrary input towards that attribute.

The above process requires a relatively small amount of labeled data, and can be done after the model has been trained (no labels are needed while training).Previous⁠(opens in a new window)work⁠(opens in a new window)using GAN’s requires training an encoder separately.Approaches⁠(opens in a new window)using⁠(opens in a new window)VAE’s only guarantee that the decoder and encoder are compatible for in-distribution data. Other approaches involve directly learning the function representing the transformation, likeCycle-GAN’s⁠(opens in a new window), however they require retraining for every transformation.

`1# Train flow model on large, unlabelled dataset X2m = train(X_unlabelled)34# Split labelled dataset based on attribute, say blonde hair5X_positive, X_negative = split(X_labelled)67# Obtain average encodings of positive and negative inputs8z_positive = average([m.encode(x) for x in X_positive])9z_negative = average([m.encode(x) for x in X_negative])1011# Get manipulation vector by taking difference12z_manipulate = z_positive - z_negative1314# Manipulate new x_input along z_manipulate, by a scalar alpha \in [-1,1]15z_input = m.encode(x_input)16x_manipulated = m.decode(z_input + alpha * z_manipulate)`

Simple code snippet for using a flow-based model for manipulating attributes

Our main contribution and also our departure from the earlier RealNVP work is the addition of a reversible 1x1 convolution, as well as removing other components, simplifying the architecture overall.

The RealNVP architecture consists of sequences of two types of layers: layers with checkboard masking, and layers with channel-wise masking. We remove the layers with checkerboard masking, simplifying the architecture. The layers with channel-wise masking perform the equivalent of a repetition of the following steps:

1. Permute the inputs by reversing their ordering across the channel dimension. 2. Split the input into two parts, A and B, down the middle of the feature dimension. 3. Feed A into a shallow convolutional neural network. Linearly transform B according to the output of the neural network. 4. Concatenate A and B.

By chaining these layers, A updates B, then B updates A, then A updates B, etc. This bipartite flow of information is clearly quite rigid. We found that model performance improves by changing the reverse permutation of step (1) to a (fixed)_shuffling_ permutation.

Taking this a step further, we can also _learn_ the optimal permutation. Learning a permutation matrix is a discrete optimization that is not amendable to gradient ascent. But because the permutation operation is just a special case of a linear transformation with a square matrix, we can make this work with convolutional neural networks, as permuting the channels is equivalent to a 1x1 convolution operation with an equal number of input and output channels. So we replace the fixed permutation with learned 1x1 convolution operations. The weights of the 1x1 convolution are initialized as a random rotation matrix. As we show in the figure below, this operation leads to significant modeling improvements. We’ve also shown that the computations involved in optimizing the objective function can be done efficiently through a LU decomposition of the weights.

Our main contribution, invertible 1x1 convolutions, leads to significant modeling improvements.

In addition, we remove batch normalization and replace it with an activation normalization layer. This layer simply shifts and scales the activations, withdata-dependent initialization⁠(opens in a new window)that normalizes the activations given an initial minibatch of data. This allows scaling down the minibatch size to 1 (for large images) and scaling up the size of the model.

Our architecture combined with various optimizations, such asgradient checkpointing⁠(opens in a new window), allows us to train flow-based generative models on a larger scale than usual. We usedHorovod⁠(opens in a new window)to easily train our model on a cluster of multiple machines; the model used in our demo was trained on 5 machines with each 8 GPUs. Using this setup we train models with over a hundred million parameters.

## Directions for research

Our work suggests that it’s possible to train flow-based models to generate realistic high-resolution images, and learned latent representations that can be easily used for downstream tasks like manipulation of data. We suggest a few directions for future work:

1. Be competitive with other model classes on likelihood.Autoregressive models and VAE’s perform better than flow-based models on log-likelihood, however they have the drawbacks of inefficient sampling and inexact inference respectively. One can combine flow-based models, VAEs and autoregresive models to trade off their strengths; this would be an interesting direction for future work. 2. Improve architecture to be more compute and parameter efficient.To generate realistic high-resolution images, the face generation model uses ~200M parameters and ~600 convolution layers, which makes it expensive to train. Models with smaller depth performed worse on learning long-range dependencies. Usingself⁠(opens in a new window)attention⁠(opens in a new window)architectures, or performingprogressive⁠(opens in a new window)training to scale to high resolutions could make it computationally cheaper to train glow models.

Finally, if you’d like use Glow in your research, we encourage you to check outour paper⁠(opens in a new window)for more details, or look at our code on thisGithub repo⁠(opens in a new window).

Prafulla Dhariwal, Durk Kingma

Special thanks to Nicholas Benson for helping us build the demo.

DALL·E 2 pre-training mitigations Publication Jun 28, 2022

Hierarchical text-conditional image generation with CLIP latents Publication Apr 13, 2022

DALL·E: Creating images from text Milestone Jan 5, 2021

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Glow: Better reversible generative models

The unpaid, unrecognised burden of the women-led care economy of India

Andrej Karpathy Transitions from Coding to Directing AI Agents

Musk and Hassabis Discuss AI's Impact on Scientific Discovery

Perfios Reports 46% Profit Increase to ₹104 Cr in FY25, Revenue Surpasses ₹700 Cr

Latest Briefs