Pooling layer

Pooling is most commonly used in convolutional neural networks (CNN). Below is a description of pooling in 2-dimensional CNNs. The generalization to n-dimensions is immediate. As notation, we consider a tensor x \in \R^{H \times W \times C}, where H is height, W is width, and C is the number of channels. A pooling layer outputs a tensor y \in \R^{H' \times W' \times C'}. We define two variables f, s called "filter size" (aka "kernel size") and "stride". Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions. In such cases, we define 4 variables:f_H, f_W, s_H, s_W. The receptive field of an entry in the output tensor, y, are all the entries in x that can affect that entry. Max pooling Max Pooling (MaxPool) is commonly used in CNNs to reduce the spatial dimensions of feature maps. Define\mathrm{MaxPool}(x | f, s)_{0, 0, 0} = \max(x_{0:f-1, 0:f-1, 0})where 0:f-1 means the range 0, 1, \dots, f-1. Note that we need to avoid the off-by-one error. The next input is\mathrm{MaxPool}(x | f, s)_{1, 0, 0} = \max(x_{s:s + f-1, 0:f-1, 0})and so on. The receptive field of y_{i, j, c} is x_{is + f-1, js + f-1, c}, so in general,\mathrm{MaxPool}(x | f, s)_{i,j,c} = \mathrm{max}(x_{is: is+f-1, js: js + f-1, c})If the horizontal and vertical filter size and strides differ, then in general,\mathrm{MaxPool}(x | f, s)_{i,j,c} = \mathrm{max}(x_{is_H: is_H+f_H-1, js_W: js_W + f_W-1, c})More succinctly, we can write y_k = \max(\{x_{k'} | k' \text{ in the receptive field of }k\}). If H is not expressible as ks + f where k is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor. In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right. Global Max Pooling (GMP) is a specific kind of max pooling where the output tensor has shape \R^{C} and the receptive field of y_c is all of x_{0:H, 0:W, c}. That is, it takes the maximum over each entire channel. It is often used just before the final fully connected layers in a CNN classification head. Average pooling Average pooling (AvgPool) is similarly defined\mathrm{AvgPool}(x | f, s)_{i,j,c} = \mathrm{average}(x_{is: is+f-1, js: js + f-1, c}) = \frac{1}{f^2} \sum_{k \in is: is+f-1}\sum_{l \in js: js + f-1} x_{k, l, c}Global Average Pooling (GAP) is defined similarly to GMP. It was first proposed in Network-in-Network. Similarly to GMP, it is often used just before the final fully connected layers in a CNN classification head. Interpolations There are some interpolations of max pooling and average pooling. Mixed Pooling is a linear sum of max pooling and average pooling. That is,\mathrm{MixedPool}(x | f, s, w) = w \mathrm{MaxPool}(x | f, s) + (1-w)\mathrm{AvgPool}(x | f, s)where w \in [0, 1] is either a hyperparameter, a learnable parameter, or randomly sampled anew every time. Lp Pooling is similar to average pooling, but uses Lp norm average instead of average:y_k = \left(\frac 1N \sum_{k' \text{ in the receptive field of } k} |x_{k'}|^p\right)^{1/p}where N is the size of receptive field, and p \geq 1 is a hyperparameter. If all activations are non-negative, then average pooling is the case of p = 1, and max pooling is the case of p \to \infty. Square-root pooling is the case of p = 2. Stochastic pooling samples a random activation x_{k'} from the receptive field with probability \frac{x_{k'}}{\sum_{k} x_{k}}. It is the same as average pooling in expectation. Softmax pooling is like max pooling, but uses softmax, i.e. \frac{\sum_{k'} e^{\beta x_{k'}}x_{k'}}{\sum_{k} e^{\beta x_{k}}} where \beta > 0. Average pooling is the case of \beta \downarrow 0, and max pooling is the case of \beta \uparrow \infty Other poolings Spatial pyramidal pooling applies max pooling (or any other form of pooling) in a pyramid structure. That is, it applies global max pooling, then applies max pooling to the image divided into 4 equal parts, then 16, etc. The results are then concatenated. It is a hierarchical form of global pooling, and similar to global pooling, it is often used just before a classification head. Region of Interest Pooling (also known as RoI pooling) is a variant of max pooling used in R-CNNs for object detection. It is designed to take an arbitrarily-sized input matrix, and output a fixed-sized output matrix. Covariance pooling computes the covariance matrix of the vectors \{x_{k, l, 0:C-1}\}_{k \in is:is + f-1, l \in js:js + f-1}which is then flattened to a C^2-dimensional vector y_{i,j, 0:C^2-1}. Global covariance pooling is used similarly to global max pooling. As average pooling computes the average, which is a first-degree statistic, and covariance is a second-degree statistic, covariance pooling is also called "second-order pooling". It can be generalized to higher-order poolings. Blur Pooling means applying a blurring method before downsampling. For example, the Rect-2 blur pooling means taking an average pooling at f = 2, s = 1, then taking every second pixel (identity with s = 2). == Vision Transformer pooling ==