There is a veritable zoo of GAN variants. Some of the most prominent are as follows:
Conditional GAN Conditional GANs are similar to standard GANs except they allow the model to conditionally generate samples based on additional information. For example, if we want to generate a cat face given a dog picture, we could use a conditional GAN. The generator in a GAN game generates \mu_G, a probability distribution on the probability space \Omega. This leads to the idea of a conditional GAN, where instead of generating one probability distribution on \Omega, the generator generates a different probability distribution \mu_G(c) on \Omega, for each given class label c. For example, for generating images that look like
ImageNet, the generator should be able to generate a picture of cat when given the class label "cat". In the original paper,
GANs with alternative architectures The GAN game is a general framework and can be run with any reasonable parametrization of the generator G and discriminator D. In the original paper, the authors demonstrated it using
multilayer perceptron networks and
convolutional neural networks. Many alternative architectures have been tried.
Deep convolutional GAN (DCGAN): For both generator and discriminator, uses only deep networks consisting entirely of convolution-deconvolution layers, that is, fully convolutional networks.
Self-attention GAN (SAGAN): Starts with the DCGAN, then adds residually-connected standard
self-attention modules to the generator and discriminator.
Variational autoencoder GAN (VAEGAN): Uses a
variational autoencoder (VAE) for the generator.
Transformer GAN (TransGAN): Uses the pure
transformer architecture for both the generator and discriminator, entirely devoid of convolution-deconvolution layers.
Flow-GAN: Uses
flow-based generative model for the generator, allowing efficient computation of the likelihood function.
GANs with alternative objectives Many GAN variants are merely obtained by changing the loss functions for the generator and discriminator.
Original GAN: We recast the original GAN objective into a form more convenient for comparison:\begin{cases} \min_D L_D(D, \mu_G) = -\operatorname E_{x\sim \mu_{G}}[\ln D(x)] - \operatorname E_{x\sim \mu_{\text{ref}}}[\ln (1-D(x))]\\ \min_G L_G(D, \mu_G) = -\operatorname E_{x\sim \mu_G}[\ln (1-D(x))] \end{cases}
Original GAN, non-saturating loss: This objective for generator was recommended in the original paper for faster convergence.
Original GAN, maximum likelihood: L_G = \operatorname E_{x\sim \mu_G}[({\exp} \circ \sigma^{-1} \circ D) (x)]where \sigma is the logistic function. When the discriminator is optimal, the generator gradient is the same as in
maximum likelihood estimation, even though GAN cannot perform maximum likelihood estimation
itself.
Hinge loss GAN: L_D = -\operatorname E_{x\sim p_{\text{ref}}}\left[\min\left(0, -1 + D(x)\right)\right] -\operatorname E_{x\sim\mu_G}\left[\min\left(0, -1 - D\left(x\right)\right)\right] L_G = -\operatorname E_{x\sim \mu_G} [D(x)]
Least squares GAN:L_D = \operatorname E_{x\sim \mu_{\text{ref}}}[(D(x)-b)^2] + \operatorname E_{x\sim \mu_G}[(D(x)-a)^2]L_G = \operatorname E_{x\sim \mu_G}[(D(x)-c)^2]where a, b, c are parameters to be chosen. The authors recommended a = -1, b = 1, c = 0.
Wasserstein GAN (WGAN) The Wasserstein GAN modifies the GAN game at two points: • The discriminator's strategy set is the set of measurable functions of type D: \Omega \to \R with bounded
Lipschitz norm: \|D\|_L \leq K , where K is a fixed positive constant. • The objective isL_{WGAN}(\mu_G, D) := \operatorname E_{x\sim \mu_G}[D(x)] -\mathbb E_{x\sim \mu_{\text{ref}}}[D(x)] One of its purposes is to solve the problem of mode collapse (see above). is more autoencoder than GAN. The idea is to start with a plain
autoencoder, but train a discriminator to discriminate the latent vectors from a reference distribution (often the normal distribution).
InfoGAN In conditional GAN, the generator receives both a noise vector z and a label c, and produces an image G(z, c). The discriminator receives image-label pairs (x, c), and computes D(x, c). When the training dataset is unlabeled, conditional GAN does not work directly. The idea of InfoGAN is to decree that every latent vector in the latent space can be decomposed as (z, c): an incompressible noise part z, and an informative label part c, and encourage the generator to comply with the decree, by encouraging it to maximize I(c, G(z, c)), the
mutual information between c and G(z, c), while making no demands on the mutual information z between G(z, c). Unfortunately, I(c, G(z, c)) is intractable in general, The key idea of InfoGAN is Variational Mutual Information Maximization: indirectly maximize it by maximizing a lower bound {\hat {I}}(G,Q)=\mathbb {E} _{z\sim \mu_Z, c\sim \mu _{C}}[\ln Q(c\mid G(z,c))]; \quad I(c, G(z, c)) \geq \sup_Q \hat I(G, Q)where Q ranges over all
Markov kernels of type Q: \Omega_Y \to \mathcal P(\Omega_C). The InfoGAN game is defined as follows:Three probability spaces define an InfoGAN game: • (\Omega_X, \mu_{\text{ref}}), the space of reference images. • (\Omega_Z, \mu_Z), the fixed random noise generator. • (\Omega_C, \mu_C), the fixed random information generator. There are 3 players in 2 teams: generator, Q, and discriminator. The generator and Q are on one team, and the discriminator on the other team. The objective function isL(G, Q, D) = L_{GAN}(G, D) - \lambda \hat I(G, Q)where L_{GAN}(G, D) = \operatorname E_{x\sim \mu_{\text{ref}}, }[\ln D(x)] + \operatorname E_{z\sim \mu_Z}[\ln (1-D(G(z, c)))] is the original GAN game objective, and \hat I(G, Q) = \mathbb E_{z\sim\mu_Z, c\sim\mu_C}[\ln Q(c \mid G(z, c))] Generator-Q team aims to minimize the objective, and discriminator aims to maximize it:\min_{G, Q} \max_D L(G, Q, D)
Bidirectional GAN (BiGAN) The standard GAN generator is a function of type G: \Omega_Z\to \Omega_X, that is, it is a mapping from a latent space \Omega_Z to the image space \Omega_X. This can be understood as a "decoding" process, whereby every latent vector z\in \Omega_Z is a code for an image x\in \Omega_X, and the generator performs the decoding. This naturally leads to the idea of training another network that performs "encoding", creating an
autoencoder out of the encoder-generator pair. Already in the original paper, The BiGAN is defined as follows: Two probability spaces define a BiGAN game: • (\Omega_X, \mu_{X}), the space of reference images. • (\Omega_Z, \mu_Z), the latent space. There are 3 players in 2 teams: generator, encoder, and discriminator. The generator and encoder are on one team, and the discriminator on the other team. The generator's strategies are functions G:\Omega_Z \to \Omega_X, and the encoder's strategies are functions E:\Omega_X \to \Omega_Z. The discriminator's strategies are functions D:\Omega_X \to [0, 1]. The objective function isL(G, E, D) = \mathbb E_{x\sim \mu_X}[\ln D(x, E(x))] + \mathbb E_{z\sim \mu_Z}[\ln (1-D(G(z), z))] Generator-encoder team aims to minimize the objective, and discriminator aims to maximize it:\min_{G, E} \max_D L(G, E, D) In the paper, they gave a more abstract definition of the objective as:L(G, E, D) = \mathbb E_{(x, z)\sim \mu_{E, X}}[\ln D(x, z)] + \mathbb E_{(x, z)\sim \mu_{G, Z}}[\ln (1-D(x, z))]where \mu_{E, X}(dx, dz) = \mu_X(dx) \cdot \delta_{E(x)}(dz) is the probability distribution on \Omega_X\times \Omega_Z obtained by
pushing \mu_X forward via x \mapsto (x, E(x)), and \mu_{G, Z}(dx, dz) = \delta_{G(z)}(dx)\cdot \mu_Z(dz) is the probability distribution on \Omega_X\times \Omega_Z obtained by pushing \mu_Z forward via z \mapsto (G(x), z). Applications of bidirectional models include
semi-supervised learning,
interpretable machine learning, and
neural machine translation.
CycleGAN CycleGAN is an architecture for performing translations between two domains, such as between photos of horses and photos of zebras, or photos of night cities and photos of day cities. The CycleGAN game is defined as follows:There are two probability spaces (\Omega_X, \mu_X), (\Omega_Y, \mu_Y), corresponding to the two domains needed for translations fore-and-back. There are 4 players in 2 teams: generators G_X: \Omega_X \to \Omega_Y, G_Y: \Omega_Y \to \Omega_X, and discriminators D_X: \Omega_X\to [0, 1], D_Y:\Omega_Y\to [0, 1]. The objective function isL(G_X, G_Y, D_X, D_Y) = L_{GAN}(G_X, D_X) +L_{GAN}(G_Y, D_Y) + \lambda L_{cycle}(G_X, G_Y) where \lambda is a positive adjustable parameter, L_{GAN} is the GAN game objective, and L_{cycle} is the
cycle consistency loss:L_{cycle}(G_X, G_Y) = E_{x\sim \mu_X} \|G_X(G_Y(x)) - x\| + E_{y\sim \mu_Y} \|G_Y(G_X(y)) - y\|The generators aim to minimize the objective, and the discriminators aim to maximize it:\min_{G_X, G_Y} \max_{D_X, D_Y} L(G_X, G_Y, D_X, D_Y) Unlike previous work like pix2pix, which requires paired training data, cycleGAN requires no paired data. For example, to train a pix2pix model to turn a summer scenery photo to winter scenery photo and back, the dataset must contain pairs of the same place in summer and winter, shot at the same angle; cycleGAN would only need a set of summer scenery photos, and an unrelated set of winter scenery photos.
GANs with particularly large or small scales BigGAN The BigGAN is essentially a self-attention GAN trained on a large scale (up to 80 million parameters) to generate large images of ImageNet (up to 512 x 512 resolution), with numerous engineering tricks to make it converge.
Invertible data augmentation When there is insufficient training data, the reference distribution \mu_{\text{ref}} cannot be well-approximated by the
empirical distribution given by the training dataset. In such cases,
data augmentation can be applied, to allow training GAN on smaller datasets. Naïve data augmentation, however, brings its problems. Consider the original GAN game, slightly reformulated as follows:\begin{cases} \min_D L_D(D, \mu_G) = -\operatorname E_{x\sim \mu_{\text{ref}}}[\ln D(x)] - \operatorname E_{x\sim \mu_G}[\ln (1-D(x))]\\ \min_G L_G(D, \mu_G) = -\operatorname E_{x\sim \mu_G}[\ln (1-D(x))] \end{cases}Now we use data augmentation by randomly sampling semantic-preserving transforms T: \Omega \to \Omega and applying them to the dataset, to obtain the reformulated GAN game:\begin{cases} \min_D L_D(D, \mu_G) = -\operatorname E_{x\sim \mu_{\text{ref}}, T\sim \mu_\text{trans}}[\ln D(T(x))] - \operatorname E_{x\sim \mu_G}[\ln (1-D(x))]\\ \min_G L_G(D, \mu_G) = -\operatorname E_{x\sim \mu_G}[\ln (1-D(x))] \end{cases}This is equivalent to a GAN game with a different distribution \mu_{\text{ref}}', sampled by T(x), with x\sim \mu_{\text{ref}}, T\sim \mu_\text{trans}. For example, if \mu_{\text{ref}} is the distribution of images in ImageNet, and \mu_\text{trans} samples identity-transform with probability 0.5, and horizontal-reflection with probability 0.5, then \mu_{\text{ref}}' is the distribution of images in ImageNet and horizontally-reflected ImageNet, combined. The result of such training would be a generator that mimics \mu_{\text{ref}}'. For example, it would generate images that look like they are randomly cropped, if the data augmentation uses random cropping. The solution is to apply data augmentation to both generated and real images:\begin{cases} \min_D L_D(D, \mu_G) = -\operatorname E_{x\sim \mu_{\text{ref}}, T\sim \mu_\text{trans}}[\ln D(T(x))] - \operatorname E_{x\sim \mu_G, T\sim \mu_\text{trans}}[\ln (1-D(T(x)))]\\ \min_G L_G(D, \mu_G) = -\operatorname E_{x\sim \mu_G, T\sim \mu_\text{trans}}[\ln (1-D(T(x)))] \end{cases}The authors demonstrated high-quality generation using just 100-picture-large datasets. The StyleGAN-2-ADA paper points out a further point on data augmentation: it must be
invertible.
StyleGAN series The StyleGAN family is a series of architectures published by
Nvidia's research division.
Progressive GAN Progressive GAN is a method for training GAN for large-scale image generation stably, by growing a GAN generator from small to large scale in a pyramidal fashion. Like SinGAN, it decomposes the generator asG = G_1 \circ G_2 \circ \cdots \circ G_N, and the discriminator as D = D_1 \circ D_2 \circ \cdots \circ D_N. During training, at first only G_N, D_N are used in a GAN game to generate 4x4 images. Then G_{N-1}, D_{N-1} are added to reach the second stage of GAN game, to generate 8x8 images, and so on, until we reach a GAN game to generate 1024x1024 images. To avoid shock between stages of the GAN game, each new layer is "blended in" (Figure 2 of the paper The key architectural choice of StyleGAN-1 is a progressive growth mechanism, similar to Progressive GAN. Each generated image starts as a constant 4\times 4 \times 512 array, and repeatedly passed through style blocks. Each style block applies a "style latent vector" via affine transform ("adaptive instance normalization"), similar to how neural style transfer uses
Gramian matrix. It then adds noise, and normalize (subtract the mean, then divide by the variance). At training time, usually only one style latent vector is used per image generated, but sometimes two ("mixing regularization") in order to encourage each style block to independently perform its stylization without expecting help from other style blocks (since they might receive an entirely different style latent vector). After training, multiple style latent vectors can be fed into each style block. Those fed to the lower layers control the large-scale styles, and those fed to the higher layers control the fine-detail styles. Style-mixing between two images x, x' can be performed as well. First, run a gradient descent to find z, z' such that G(z)\approx x, G(z')\approx x'. This is called "projecting an image back to style latent space". Then, z can be fed to the lower style blocks, and z' to the higher style blocks, to generate a composite image that has the large-scale style of x, and the fine-detail style of x'. Multiple images can also be composed this way.
StyleGAN-2 StyleGAN-2 improves upon StyleGAN-1, by using the style latent vector to transform the convolution layer's weights instead, thus solving the "blob" problem. This was updated by the StyleGAN-2-ADA ("ADA" stands for "adaptive"), which uses invertible data augmentation as described above. It also tunes the amount of data augmentation applied by starting at zero, and gradually increasing it until an "overfitting heuristic" reaches a target level, thus the name "adaptive".
StyleGAN-3 StyleGAN-3 improves upon StyleGAN-2 by solving the "texture sticking" problem, which can be seen in the official videos. They analyzed the problem by the
Nyquist–Shannon sampling theorem, and argued that the layers in the generator learned to exploit the high-frequency signal in the pixels they operate upon. To solve this, they proposed imposing strict
lowpass filters between each generator's layers, so that the generator is forced to operate on the pixels in a way
faithful to the continuous signals they represent, rather than operate on them as merely discrete signals. They further imposed rotational and translational invariance by using more
signal filters. The resulting StyleGAN-3 is able to solve the texture sticking problem, as well as generating images that rotate and translate smoothly. == Other uses ==