RBF networks are typically trained from pairs of input and target values \mathbf{x}(t), y(t), t = 1, \dots, T by a two-step algorithm. In the first step, the center vectors \mathbf c_i of the RBF functions in the hidden layer are chosen. This step can be performed in several ways; centers can be randomly sampled from some set of examples, or they can be determined using
k-means clustering. Note that this step is
unsupervised. The second step simply fits a linear model with coefficients w_i to the hidden layer's outputs with respect to some objective function. A common objective function, at least for regression/function estimation, is the least squares function: : K( \mathbf{w} ) \ \stackrel{\mathrm{def}}{=}\ \sum_{t=1}^T K_t( \mathbf{w} ) where : K_t( \mathbf{w} ) \ \stackrel{\mathrm{def}}{=}\ \big [ y(t) - \varphi \big ( \mathbf{x}(t), \mathbf{w} \big ) \big ]^2 . We have explicitly included the dependence on the weights. Minimization of the least squares objective function by optimal choice of weights optimizes accuracy of fit. There are occasions in which multiple objectives, such as smoothness as well as accuracy, must be optimized. In that case it is useful to optimize a regularized objective function such as : H( \mathbf{w} ) \ \stackrel{\mathrm{def}}{=}\ K( \mathbf{w} ) + \lambda S( \mathbf{w} ) \ \stackrel{\mathrm{def}}{=}\ \sum_{t=1}^T H_t( \mathbf{w} ) where : S( \mathbf{w} ) \ \stackrel{\mathrm{def}}{=}\ \sum_{t=1}^T S_t( \mathbf{w} ) and : H_t( \mathbf{w} ) \ \stackrel{\mathrm{def}}{=}\ K_t ( \mathbf{w} ) + \lambda S_t ( \mathbf{w} ) where optimization of S maximizes smoothness and \lambda is known as a
regularization parameter. A third optional
backpropagation step can be performed to fine-tune all of the RBF net's parameters.
Interpolation RBF networks can be used to interpolate a function y: \mathbb{R}^n \to \mathbb{R} when the values of that function are known on finite number of points: y(\mathbf x_i) = b_i, i=1, \ldots, N. Taking the known points \mathbf x_i to be the centers of the radial basis functions and evaluating the values of the basis functions at the same points g_{ij} = \rho(|| \mathbf x_j - \mathbf x_i ||) the weights can be solved from the equation :\left[ \begin{matrix} g_{11} & g_{12} & \cdots & g_{1N} \\ g_{21} & g_{22} & \cdots & g_{2N} \\ \vdots & & \ddots & \vdots \\ g_{N1} & g_{N2} & \cdots & g_{NN} \end{matrix}\right] \left[ \begin{matrix} w_1 \\ w_2 \\ \vdots \\ w_N \end{matrix} \right] = \left[ \begin{matrix} b_1 \\ b_2 \\ \vdots \\ b_N \end{matrix} \right] It can be shown that the interpolation matrix in the above equation is non-singular, if the points \mathbf x_i are distinct, and thus the weights w can be solved by simple
linear algebra: :\mathbf{w} = \mathbf{G}^{-1} \mathbf{b} where G = (g_{ij}).
Function approximation If the purpose is not to perform strict interpolation but instead more general
function approximation or
classification the optimization is somewhat more complex because there is no obvious choice for the centers. The training is typically done in two phases first fixing the width and centers and then the weights. This can be justified by considering the different nature of the non-linear hidden neurons versus the linear output neuron.
Training the basis function centers Basis function centers can be randomly sampled among the input instances or obtained by Orthogonal Least Square Learning Algorithm or found by
clustering the samples and choosing the cluster means as the centers. The RBF widths are usually all fixed to same value which is proportional to the maximum distance between the chosen centers.
Pseudoinverse solution for the linear weights After the centers c_i have been fixed, the weights that minimize the error at the output can be computed with a linear
pseudoinverse solution: :\mathbf{w} = \mathbf{G}^+ \mathbf{b}, where the entries of
G are the values of the radial basis functions evaluated at the points x_i: g_{ji} = \rho(||x_j-c_i||). The existence of this linear solution means that unlike multi-layer perceptron (MLP) networks, RBF networks have an explicit minimizer (when the centers are fixed).
Gradient descent training of the linear weights Another possible training algorithm is
gradient descent. In gradient descent training, the weights are adjusted at each time step by moving them in a direction opposite from the gradient of the objective function (thus allowing the minimum of the objective function to be found), : \mathbf{w}(t+1) = \mathbf{w}(t) - \nu \frac {d} {d\mathbf{w}} H_t(\mathbf{w}) where \nu is a "learning parameter." For the case of training the linear weights, a_i , the algorithm becomes : a_i (t+1) = a_i(t) + \nu \big [ y(t) - \varphi \big ( \mathbf{x}(t), \mathbf{w} \big ) \big ] \rho \big ( \left \Vert \mathbf{x}(t) - \mathbf{c}_i \right \Vert \big ) in the unnormalized case and : a_i (t+1) = a_i(t) + \nu \big [ y(t) - \varphi \big ( \mathbf{x}(t), \mathbf{w} \big ) \big ] u \big ( \left \Vert \mathbf{x}(t) - \mathbf{c}_i \right \Vert \big ) in the normalized case. For local-linear-architectures gradient-descent training is : e_{ij} (t+1) = e_{ij}(t) + \nu \big [ y(t) - \varphi \big ( \mathbf{x}(t), \mathbf{w} \big ) \big ] v_{ij} \big ( \mathbf{x}(t) - \mathbf{c}_i \big )
Projection operator training of the linear weights For the case of training the linear weights, a_i and e_{ij} , the algorithm becomes : a_i (t+1) = a_i(t) + \nu \big [ y(t) - \varphi \big ( \mathbf{x}(t), \mathbf{w} \big ) \big ] \frac {\rho \big ( \left \Vert \mathbf{x}(t) - \mathbf{c}_i \right \Vert \big )} {\sum_{i=1}^N \rho^2 \big ( \left \Vert \mathbf{x}(t) - \mathbf{c}_i \right \Vert \big )} in the unnormalized case and : a_i (t+1) = a_i(t) + \nu \big [ y(t) - \varphi \big ( \mathbf{x}(t), \mathbf{w} \big ) \big ] \frac {u \big ( \left \Vert \mathbf{x}(t) - \mathbf{c}_i \right \Vert \big )} {\sum_{i=1}^N u^2 \big ( \left \Vert \mathbf{x}(t) - \mathbf{c}_i \right \Vert \big )} in the normalized case and : e_{ij} (t+1) = e_{ij}(t) + \nu \big [ y(t) - \varphi \big ( \mathbf{x}(t), \mathbf{w} \big ) \big ] \frac { v_{ij} \big ( \mathbf{x}(t) - \mathbf{c}_i \big ) } {\sum_{i=1}^N \sum_{j=1}^n v_{ij}^2 \big ( \mathbf{x}(t) - \mathbf{c}_i \big ) } in the local-linear case. For one basis function, projection operator training reduces to
Newton's method. ==Examples==