A CNN architecture is formed by a stack of distinct layers that transform the input volume into an output volume (e.g. holding the class scores) through a differentiable function. A few distinct types of layers are commonly used. These are further discussed below.
Convolutional layer . The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or
kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is
convolved across the width and height of the input volume, computing the
dot product between the filter entries and the input, producing a 2-dimensional
activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of
feature at some spatial position in the input. Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input. Each entry in an activation map use the same set of parameters that define the filter.
Self-supervised learning has been adapted for use in convolutional layers by using sparse patches with a high-mask ratio and a global response normalization layer.
Local connectivity When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account. Convolutional networks exploit spatially local correlation by enforcing a
sparse local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume. The extent of this connectivity is a
hyperparameter called the
receptive field of the neuron. The connections are
local in space (along width and height), but always extend along the entire depth of the input volume. Such an architecture ensures that the learned filters produce the strongest response to a spatially local input pattern.
Spatial arrangement Three
hyperparameters control the size of the output volume of the convolutional layer: the depth,
stride, and padding size: • The
depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features in the input. For example, if the first convolutional layer takes the raw image as input, then different neurons along the depth dimension may activate in the presence of various oriented edges, or blobs of color. •
Stride controls how depth columns around the width and height are allocated. If the stride is 1, then we move the filters one pixel at a time. This leads to heavily
overlapping receptive fields between the columns, and to large output volumes. For any integer S > 0, a stride
S means that the filter is translated
S units at a time per output. In practice, S \geq 3 is rare. A greater stride means smaller overlap of receptive fields and smaller spatial dimensions of the output volume. • Sometimes, it is convenient to pad the input with zeros (or other values, such as the average of the region) on the border of the input volume. The size of this padding is a third hyperparameter. Padding provides control of the output volume's spatial size. In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume, this is commonly referred to as "same" padding. The spatial size of the output volume is a function of the input volume size W, the kernel field size K of the convolutional layer neurons, the stride S, and the amount of zero padding P on the border. The number of neurons that "fit" in a given volume is then: :\frac{W-K+2P}{S} + 1. If this number is not an
integer, then the strides are incorrect and the neurons cannot be tiled to fit across the input volume in a
symmetric way. In general, setting zero padding to be P = (K-1)/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially. However, it is not always completely necessary to use all of the neurons of the previous layer. For example, a neural network designer may decide to use just a portion of padding.
Parameter sharing A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2-dimensional slice of depth as a
depth slice, the neurons in each depth slice are constrained to use the same weights and bias. Since all neurons in a single depth slice share the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a
convolution of the neuron's weights with the input volume. Therefore, it is common to refer to the sets of weights as a filter (or a
kernel), which is convolved with the input. The result of this convolution is an
activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume. Parameter sharing contributes to the
translation invariance of the CNN architecture. Note that without using a stride greater than 1, pooling would not perform downsampling, as it would simply move the pooling window across the input one step at a time, without reducing the size of the feature map. In other words, the stride is what actually causes the downsampling by determining how much the pooling window moves over the input. Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters,
memory footprint and amount of computation in the network, and hence to also control
overfitting. This is known as down-sampling. It is common to periodically insert a pooling layer between successive convolutional layers (each one typically followed by an activation function, such as a
ReLU layer) in a CNN architecture. The pooling layer commonly operates independently on every depth, or slice, of the input and resizes it spatially. A very common form of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations:f_{X,Y}(S)=\max_{a,b=0}^1S_{2X+a,2Y+b}. In this case, every
max operation is over 4 numbers. The depth dimension remains unchanged (this is true for other forms of pooling as well). In addition to max pooling, pooling units can use other functions, such as
average pooling or
ℓ2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which generally performs better in practice. Due to the effects of fast spatial reduction of the size of the representation, there is a recent trend towards using smaller filters or discarding pooling layers altogether.
Channel max pooling A channel max pooling (CMP) operation layer conducts the MP operation along the channel side among the corresponding positions of the consecutive feature maps for the purpose of redundant information elimination. The CMP makes the significant features gather together within fewer channels, which is important for fine-grained image classification that needs more discriminating features. Meanwhile, another advantage of the CMP operation is to make the channel number of feature maps smaller before it connects to the first fully connected (FC) layer. Similar to the MP operation, we denote the input feature maps and output feature maps of a CMP layer as F ∈ R(C×M×N) and C ∈ R(c×M×N), respectively, where C and c are the channel numbers of the input and output feature maps, M and N are the widths and the height of the feature maps, respectively. Note that the CMP operation only changes the channel number of the feature maps. The width and the height of the feature maps are not changed, which is different from the MP operation. See for reviews for pooling methods.
ReLU layer ReLU is the abbreviation of
rectified linear unit. It was proposed by
Alston Householder in 1941, and used in CNN by
Kunihiko Fukushima in 1969. It effectively removes negative values from an activation map by setting them to zero. It introduces
nonlinearity to the
decision function and in the overall network without affecting the receptive fields of the convolution layers. In 2011, Xavier Glorot, Antoine Bordes and
Yoshua Bengio found that ReLU enables better training of deeper networks, compared to widely used activation functions prior to 2011. Other functions can also be used to increase nonlinearity, for example the saturating
hyperbolic tangent f(x)=\tanh(x), f(x)=|\tanh(x)|, and the
sigmoid function \sigma(x)=(1+e^{-x} )^{-1}. ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to
generalization accuracy.
Fully connected layer After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional)
artificial neural networks. Their activations can thus be computed as an
affine transformation, with
matrix multiplication followed by a bias offset (
vector addition of a learned or fixed bias term).
Loss layer The "loss layer", or "
loss function", exemplifies how
training penalizes the deviation between the predicted output of the network, and the
true data labels (during supervised learning). Various
loss functions can be used, depending on the specific task. The
Softmax loss function is used for predicting a single class of
K mutually exclusive classes.
Sigmoid cross-entropy loss is used for predicting
K independent probability values in [0,1].
Euclidean loss is used for
regressing to
real-valued labels (-\infty,\infty). == Hyperparameters ==