Training the generator in Wasserstein GAN is just
gradient descent, the same as in GAN (or most deep learning methods), but training the discriminator is different, as the discriminator is now restricted to have bounded Lipschitz norm. There are several methods for this.
Upper-bounding the Lipschitz norm Let the discriminator function D to be implemented by a
multilayer perceptron:D = D_n \circ D_{n-1} \circ \cdots \circ D_1where D_i(x) = h(W_i x), and h:\R \to \R is a fixed activation function with \sup_x |h'(x)| \leq 1. For example, the
hyperbolic tangent function h = \tanh satisfies the requirement. Then, for any x, let x_i = (D_i \circ D_{i-1} \circ \cdots \circ D_1)(x), we have by the
chain rule:d D(x) = diag(h'(W_n x_{n-1})) \cdot W_n \cdot diag(h'(W_{n-1} x_{n-2})) \cdot W_{n-1} \cdots diag(h'(W_1 x)) \cdot W_1 \cdot dxThus, the Lipschitz norm of D is upper-bounded by\|D \|_L \leq \sup_{x}\| diag(h'(W_n x_{n-1})) \cdot W_n \cdot diag(h'(W_{n-1} x_{n-2})) \cdot W_{n-1} \cdots diag(h'(W_1 x)) \cdot W_1\|_Fwhere \|\cdot\|_s is the
operator norm of the matrix, that is, the largest
singular value of the matrix, that is, the
spectral radius of the matrix (these concepts are the same for matrices, but different for general
linear operators). Since \sup_x |h'(x)| \leq 1, we have \|diag(h'(W_i x_{i-1}))\|_s = \max_j |h'(W_i x_{i-1, j})| \leq 1, and consequently the upper bound:\|D \|_L \leq \prod_{i=1}^n \|W_i \|_sThus, if we can upper-bound operator norms \|W_i\|_s of each matrix, we can upper-bound the Lipschitz norm of D.
Weight clipping Since for any m\times l matrix W, let c = \max_{i, j} |W_{i, j}|, we have\|W\|_s^2 = \sup_{\|x\|_2=1}\|W x\|_2^2 = \sup_{\|x\|_2=1}\sum_{i}\left(\sum_j W_{i, j} x_j\right)^2 = \sup_{\|x\|_2=1}\sum_{i, j, k}W_{ij}W_{ik}x_jx_k \leq c^2 ml^2by clipping all entries of W to within some interval [-c, c], we can bound \|W\|_s. This is the weight clipping method, proposed by the original paper.
Gradient penalty Instead of strictly bounding \|D\|_L, we can simply add a "gradient penalty" term for the discriminator, of form\mathbb{E}_{x\sim\hat\mu}[(\|\nabla D(x)\|_2 - a)^2]where \hat \mu is a fixed distribution used to estimate how much the discriminator has violated the Lipschitz norm requirement. The discriminator, in attempting to minimize the new loss function, would naturally bring \nabla D(x) close to a everywhere, thus making \|D\|_L \approx a. This is the gradient penalty method. == Further reading ==