This section closely follows the original paper.
Overview The idea of Neural Style Transfer (NST) is to take two images—a content image \vec{p} and a style image \vec{a}—and generate a third image \vec{x} that minimizes a weighted combination of two loss functions: a content loss \mathcal{L}_{\text {content }}(\vec{p}, \vec{x}) and a style loss \mathcal L_{\text {style }}(\vec{a}, \vec{x}). The total loss is a linear sum of the two: \mathcal{L}_{\text {NST}}(\vec{p}, \vec a, \vec{x}) = \alpha \mathcal{L}_{\text {content}}(\vec{p}, \vec{x}) + \beta \mathcal L_{\text {style}}(\vec{a}, \vec{x}) By jointly minimizing the content and style losses, NST generates an image that blends the content of the content image with the style of the style image. Both the content loss and the style loss measures the similarity of two images. The content similarity is the weighted sum of squared-differences between the neural activations of a single
convolutional neural network (CNN) on two images. The style similarity is the weighted sum of
Gram matrices within each layer (see below for details). The original paper used a
VGG-19 CNN, but the method works for any CNN.
Symbols Let \vec{x} be an image input to a CNN. Let F^l \in \mathbb{R}^{N_l \times M_l} be the matrix of filter responses in layer l to the image \vec{x}, where: • N_l is the number of filters in layer l ; • M_l is the height times the width (i.e. number of pixels) of each filter in layer l ; • F_{ij}^l(\vec x) is the activation of the i^{\text{th}} filter at position j in layer l. A given input image \vec{x} is encoded in each layer of the CNN by the filter responses to that image, with higher layers encoding more global features, but losing details on local features.
Content loss Let \vec{p} be an original image. Let \vec{x} be an image that is generated to match the content of \vec{p}. Let P^l be the matrix of filter responses in layer l to the image \vec{p}. The content loss is defined as the squared-error loss between the feature representations of the generated image and the content image at a chosen layer l of a CNN: \mathcal{L}_{\text {content }}(\vec{p}, \vec{x}, l)= \frac{1}{2} \sum_{i, j}\left(A_{i j}^l(\vec x)- A_{i j}^l(\vec p)\right)^2 where A_{ij}^l(\vec x) and A_{ij}^l(\vec{p} ) are the activations of the i^{\text{th}} filter at position j in layer l for the generated and content images, respectively. Minimizing this loss encourages the generated image to have similar content to the content image, as captured by the feature activations in the chosen layer. The total content loss is a linear sum of the content losses of each layer: \mathcal{L}_{\text {content }}(\vec{p}, \vec{x}) = \sum_l v_l \mathcal{L}_{\text {content }}(\vec{p}, \vec{x}, l), where the v_l are positive real numbers chosen as hyperparameters.
Style loss The style loss is based on the Gram matrices of the generated and style images, which capture the correlations between different filter responses at different layers of the CNN: \mathcal{L}_{\text {style }}(\vec{a}, \vec{x})=\sum_{l=0}^L w_l E_l, where E_l=\frac{1}{4 N_l^2 M_l^2} \sum_{i, j}\left(G_{i j}^l(\vec x)-G_{i j}^l(\vec a)\right)^2.Here, G_{i j}^l(\vec x) and G_{i j}^l(\vec a) are the entries of the
Gram matrices for the generated and style images at layer l. Explicitly, G_{i j}^l(\vec x)=\sum_k F_{i k}^l(\vec x) F_{j k}^l(\vec x) Minimizing this loss encourages the generated image to have similar style characteristics to the style image, as captured by the correlations between feature responses in each layer. The idea is that activation pattern correlations between filters in a single layer captures the "style" on the order of the receptive fields at that layer. Similarly to the previous case, the w_l are positive real numbers chosen as hyperparameters.
Hyperparameters In the original paper, they used a particular choice of hyperparameters. The style loss is computed by w_l = 0.2 for the outputs of layers conv1_1, conv2_1, conv3_1, conv4_1, conv5_1 in the VGG-19 network, and zero otherwise. The content loss is computed by w_l = 1 for conv4_2, and zero otherwise. The ratio \alpha / \beta \in [5, 50] \times 10^{-4}.
Training Image \vec x is initially approximated by adding a small amount of white noise to input image \vec p and feeding it through the CNN. Then we successively
backpropagate this loss through the network with the CNN weights fixed in order to update the pixels of \vec x. After several thousand epochs of training, an \vec x (hopefully) emerges that matches the style of \vec a and the content of \vec p. , when implemented on a
GPU, it takes a few minutes to converge. == Extensions ==