Let \mathbb F be a
field such as the
real numbers \mathbb R or the
complex numbers \mathbb C. A tensor {\mathcal T} \in {\mathbb F}^{I_0 \times I_2 \times \ldots \times I_C} is a multilinear transformation from a set of domain vector spaces to a range vector space: {\mathcal T}:\{{\mathbb F}^{I_1} \times {\mathbb F}^{I_2} \times \ldots {\mathbb F}^{I_C}\}\mapsto{\mathbb F}^{I_0} Here, C and I_0, I_1, \ldots, I_C are positive integers, and (C+1) is the number of
modes of a tensor (also known as the number of ways of a multi-way array). The dimensionality of mode c is I_c, for 0\le c\le C . In statistics and machine learning, an image is vectorized when viewed as a single observation, and a collection of vectorized images is organized as a "data tensor". For example, a set of facial images \{{\mathbb d}_{i_p,i_e,i_l,i_v}\in {\mathbb R}^{I_X}\} with I_X pixels that are the consequences of multiple causal factors, such as a facial geometry i_p (1\le i_p\le I_P), an expression i_e (1\le i_e\le I_E), an illumination condition i_l (1\le i_l\le I_L), and a viewing condition i_v (1\le i_v\le I_V) may be organized into a data tensor (ie. multiway array) {\mathcal D}\in {\mathbb R}^{I_X\times I_P \times I_E\times I_L \times V} where I_P are the total number of facial geometries, I_E are the total number of expressions, I_L are the total number of illumination conditions, and I_V are the total number of viewing conditions. Tensor factorizations methods such as TensorFaces and multilinear (tensor) independent component analysis factorizes the data tensor into a set of vector spaces that span the causal factor representations, where an image is the result of tensor transformation {\mathcal T} that maps a set of causal factor representations to the pixel space. Another approach to using tensors in machine learning is to embed various data types directly. For example, a
grayscale image, commonly represented as a discrete 2-way array {\mathbf D}\in {\mathbb R}^{I_{RX}\times I_{CX}} with dimensionality I_{RX} \times I_{CX} where I_{RX} are the number of rows and I_{CX} are the number of columns. When an image is treated as 2-way array or 2nd order tensor (i.e. as a collection of column/row observations), tensor factorization methods compute the image column space, the image row space and the normalized PCA coefficients or the ICA coefficients. Similarly, a color image with RGB channels, \mathcal{D}\in \mathbb{R}^{N \times M \times 3}. may be viewed as a 3rd order data tensor or 3-way array.-------- In natural language processing, a word might be expressed as a vector v via the
Word2vec algorithm. Thus v becomes a mode-1 tensor : v \mapsto \mathcal{A}\in \mathbb{R}^N. The embedding of subject-object-verb semantics requires embedding relationships among three words. Because a word is itself a vector, subject-object-verb semantics could be expressed using mode-3 tensors : v_a \times v_b \times v_c \mapsto \mathcal{A}\in \mathbb{R}^{N \times N \times N}. In practice the neural network designer is primarily concerned with the specification of embeddings, the connection of tensor layers, and the operations performed on them in a network. Modern machine learning frameworks manage the optimization, tensor factorization and backpropagation automatically.
As unit values Tensors may be used as the unit values of neural networks which extend the concept of scalar, vector and matrix values to multiple dimensions. The output value of single layer unit y_m is the sum-product of its input units and the connection weights filtered through the
activation function f: : y_m = f\left(\sum_n x_n u_{m,n}\right), where :y_m \in \mathbb{R}. If each output element of y_m is a scalar, then we have the classical definition of an
artificial neural network. By replacing each unit component with a tensor, the network is able to express higher dimensional data such as images or videos: : y_m \in \mathbb{R}^{I_0 \times I_1 \times .. \times I_C}. This use of tensors to replace unit values is common in
convolutional neural networks where each unit might be an image processed through multiple layers. By embedding the data in tensors such network structures enable learning of complex data types.
In fully connected layers Tensors may also be used to compute the layers of a fully connected neural network, where the tensor is applied to the entire layer instead of individual unit values. The output value of single layer unit y_m is the sum-product of its input units and the connection weights filtered through the
activation function f: : y_m = f\left(\sum_n x_n u_{m,n}\right). The vectors x and y of output values can be expressed as a mode-1 tensors, while the hidden weights can be expressed as a mode-2 tensor. In this example the unit values are scalars while the tensor takes on the dimensions of the network layers: : x_n \mapsto \mathcal{X}\in \mathbb{R}^{1 \times N}, : y_n \mapsto \mathcal{Y}\in \mathbb{R}^{M \times 1}, : u_n \mapsto \mathcal{U}\in \mathbb{R}^{N \times M}. In this notation, the output values can be computed as a tensor product of the input and weight tensors: : \mathcal{Y} = f ( \mathcal{X} \mathcal{U} ). which computes the sum-product as a tensor multiplication (similar to matrix multiplication). This formulation of tensors enables the entire layer of a fully connected network to be efficiently computed by mapping the units and weights to tensors.
In convolutional layers A different reformulation of neural networks allows tensors to express the convolution layers of a neural network. A convolutional layer has multiple inputs, each of which is a spatial structure such as an image or volume. The inputs are convolved by
filtering before being passed to the next layer. A typical use is to perform feature detection or isolation in image recognition.
Convolution is often computed as the multiplication of an input signal g with a filter kernel f. In two dimensions the discrete, finite form is: : (f*g)_{x,y} = \sum_{j=-w}^w \sum_{k=-w}^w f_{j,k} g_{x+j,y+k}, where w is the width of the kernel. This definition can be rephrased as a matrix-vector product in terms of tensors that express the kernel, data and inverse transform of the kernel. : \mathcal{Y} = \mathcal{A}[(Cg) \odot (Bd)], where \mathcal{A}, \mathcal{B} and \mathcal{C} are the inverse transform, data and kernel. The derivation is more complex when the filtering kernel also includes a non-linear activation function such as
sigmoid or
ReLU. The hidden weights of the convolution layer are the parameters to the filter. These can be reduced with a
pooling layer which reduces the resolution (size) of the data, and can also be expressed as a tensor operation.
Tensor factorization An important contribution of tensors in machine learning is the ability to
factorize tensors to decompose data into constituent factors or reduce the learned parameters. Data tensor modeling techniques stem from the linear tensor decomposition (CANDECOMP/Parafac decomposition) and the multilinear tensor decompositions (Tucker).
Tucker decomposition Tucker decomposition, for example, takes a 3-way array \mathcal{X} \in \mathbb{R}^{I \times J \times K} and decomposes the tensor into three matrices \mathcal{A,B,C} and a smaller tensor \mathcal{G}. The shape of the matrices and new tensor are such that the total number of elements is reduced. The new tensors have shapes : \mathcal{A} \in \mathbb{R}^{I \times P}, : \mathcal{B} \in \mathbb{R}^{J \times Q}, : \mathcal{C} \in \mathbb{R}^{K \times R}, : \mathcal{G} \in \mathbb{R}^{P \times Q \times R}. Then the original tensor can be expressed as the tensor product of these four tensors: : \mathcal{X} = \mathcal{G} \times \mathcal{A} \times \mathcal{B} \times \mathcal{C}. In the example shown in the figure, the dimensions of the tensors are : \mathcal{X}: I=8, J=6, K=3, \mathcal{A}: I=8, P=5, \mathcal{B}: J=6, Q=4, \mathcal{C}: K=3, R=2, \mathcal{G}: P=5, Q=4, R=2. The total number of elements in the Tucker factorization is : |\mathcal{A}|+|\mathcal{B}|+|\mathcal{C}|+|\mathcal{G}| = : (I \times P) + (J \times Q) + (K \times R) + (P \times Q \times R) = 8\times5 + 6\times4 + 3\times2 + 5\times4\times2 = 110. The number of elements in the original \mathcal{X} is 144, resulting in a
data reduction from 144 down to 110 elements, a reduction of 23% in parameters or data size. For much larger initial tensors, and depending on the rank (redundancy) of the tensor, the gains can be more significant. The work of Rabanser et al. provides an introduction to tensors with more details on the extension of Tucker decomposition to N-dimensions beyond the mode-3 example given here.
Tensor trains Another technique for decomposing tensors rewrites the initial tensor as a sequence (train) of smaller sized tensors. A tensor-train (TT) is a sequence of tensors of reduced rank, called
canonical factors. The original tensor can be expressed as the sum-product of the sequence. : \mathcal{X} = \mathcal{G_1} \mathcal{G_2} \mathcal{G_3} .. \mathcal{G_d} Developed in 2011 by Ivan Oseledts, the author observes that Tucker decomposition is "suitable for small dimensions, especially for the three-dimensional case. For large
d it is not suitable." Thus tensor-trains can be used to factorize larger tensors in higher dimensions.
Tensor graphs The unified data architecture and automatic differentiation of tensors has enabled higher-level designs of machine learning in the form of tensor graphs. This leads to new architectures, such as tensor-graph convolutional networks (TGCN), which identify highly non-linear associations in data, combine multiple relations, and scale gracefully, while remaining robust and performant. These developments are impacting all areas of machine learning, such as
text mining and clustering, time varying data, and neural networks wherein the input data is a social graph and the data changes dynamically. ==Hardware==