Radial basis function network

Radial basis function (RBF) networks typically have three layers: an input layer, a hidden layer with a non-linear RBF activation function and a linear output layer. The input can be modeled as a vector of real numbers \mathbf{x} \in \mathbb{R}^n. The output of the network is then a scalar function of the input vector, \varphi : \mathbb{R}^n \to \mathbb{R} , and is given by :\varphi(\mathbf{x}) = \sum_{i=1}^N a_i \rho(||\mathbf{x}-\mathbf{c}_i||) where N is the number of neurons in the hidden layer, \mathbf c_i is the center vector for neuron i, and a_i is the weight of neuron i in the linear output neuron. Functions that depend only on the distance from a center vector are radially symmetric about that vector, hence the name radial basis function. In the basic form, all inputs are connected to each hidden neuron. The norm is typically taken to be the Euclidean distance (although the Mahalanobis distance appears to perform better with pattern recognition) and the radial basis function is commonly taken to be Gaussian : \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) = \exp \left[ -\beta_i \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert ^2 \right] . The Gaussian basis functions are local to the center vector in the sense that :\lim_{||x|| \to \infty}\rho(\left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert) = 0 i.e. changing parameters of one neuron has only a small effect for input values that are far away from the center of that neuron. Given certain mild conditions on the shape of the activation function, RBF networks are universal approximators on a compact subset of \mathbb{R}^n. This means that an RBF network with enough hidden neurons can approximate any continuous function on a closed, bounded set with arbitrary precision. The parameters a_i , \mathbf{c}_i , and \beta_i are determined in a manner that optimizes the fit between \varphi and the data. Normalization Normalized architecture In addition to the above unnormalized architecture, RBF networks can be normalized. In this case the mapping is : \varphi ( \mathbf{x} ) \ \stackrel{\mathrm{def}}{=}\ \frac { \sum_{i=1}^N a_i \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) } { \sum_{i=1}^N \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) } = \sum_{i=1}^N a_i u \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) where : u \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) \ \stackrel{\mathrm{def}}{=}\ \frac { \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) } { \sum_{j=1}^N \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_j \right \Vert \big ) } is known as a normalized radial basis function. Theoretical motivation for normalization There is theoretical justification for this architecture in the case of stochastic data flow. Assume a stochastic kernel approximation for the joint probability density : P\left ( \mathbf{x} \land y \right ) = {1 \over N} \sum_{i=1}^N \, \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) \, \sigma \big ( \left \vert y - e_i \right \vert \big ) where the weights \mathbf{c}_i and e_i are exemplars from the data and we require the kernels to be normalized : \int \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) \, d^n\mathbf{x} =1 and : \int \sigma \big ( \left \vert y - e_i \right \vert \big ) \, dy =1. The probability densities in the input and output spaces are : P \left ( \mathbf{x} \right ) = \int P \left ( \mathbf{x} \land y \right ) \, dy = {1 \over N} \sum_{i=1}^N \, \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) and : The expectation of y given an input \mathbf{x} is : \varphi \left ( \mathbf{x} \right ) \ \stackrel{\mathrm{def}}{=}\ E\left ( y \mid \mathbf{x} \right ) = \int y \, P\left ( y \mid \mathbf{x} \right ) dy where : P\left ( y \mid \mathbf{x} \right ) is the conditional probability of y given \mathbf{x} . The conditional probability is related to the joint probability through Bayes' theorem : P\left ( y \mid \mathbf{x} \right ) = \frac {P \left ( \mathbf{x} \land y \right )} {P \left ( \mathbf{x} \right )} which yields : \varphi \left ( \mathbf{x} \right ) = \int y \, \frac {P \left ( \mathbf{x} \land y \right )} {P \left ( \mathbf{x} \right )} \, dy . This becomes : \varphi \left ( \mathbf{x} \right ) = \frac { \sum_{i=1}^N e_i \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) } { \sum_{i=1}^N \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) } = \sum_{i=1}^N e_i u \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) when the integrations are performed. Local linear models It is sometimes convenient to expand the architecture to include local linear models. In that case the architectures become, to first order, : \varphi \left ( \mathbf{x} \right ) = \sum_{i=1}^N \left ( a_i + \mathbf{b}_i \cdot \left ( \mathbf{x} - \mathbf{c}_i \right ) \right )\rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) and : \varphi \left ( \mathbf{x} \right ) = \sum_{i=1}^N \left ( a_i + \mathbf{b}_i \cdot \left ( \mathbf{x} - \mathbf{c}_i \right ) \right )u \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) in the unnormalized and normalized cases, respectively. Here \mathbf{b}_i are weights to be determined. Higher order linear terms are also possible. This result can be written : \varphi \left ( \mathbf{x} \right ) = \sum_{i=1}^{2N} \sum_{j=1}^n e_{ij} v_{ij} \big ( \mathbf{x} - \mathbf{c}_i \big ) where : e_{ij} = \begin{cases} a_i, & \mbox{if } i \in [1,N] \\ b_{ij}, & \mbox{if }i \in [N+1,2N] \end{cases} and : v_{ij}\big ( \mathbf{x} - \mathbf{c}_i \big ) \ \stackrel{\mathrm{def}}{=}\ \begin{cases} \delta_{ij} \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) , & \mbox{if } i \in [1,N] \\ \left ( x_{ij} - c_{ij} \right ) \rho \big ( \left \Vert \mathbf{x} - \mathbf{c}_i \right \Vert \big ) , & \mbox{if }i \in [N+1,2N] \end{cases} in the unnormalized case and in the normalized case. Here \delta_{ij} is a Kronecker delta function defined as : \delta_{ij} = \begin{cases} 1, & \mbox{if }i = j \\ 0, & \mbox{if }i \ne j \end{cases} . ==Training==