The use of the exponential window function is first attributed to
Poisson as an extension of a numerical analysis technique from the 17th century, and later adopted by the
signal processing community in the 1940s. Here, exponential smoothing is the application of the exponential, or Poisson,
window function. Exponential smoothing was suggested in the statistical literature without citation to previous work by
Robert Goodell Brown in 1956, and expanded by
Charles C. Holt in 1957. The formulation below, which is the one commonly used, is attributed to Brown and is known as "Brown’s simple exponential smoothing". All the methods of Holt, Winters, and Brown may be seen as a simple application of
recursive filtering, first found in the 1940s Unlike some other smoothing methods, such as the simple moving average, this technique does not require any minimum number of observations to be made before it begins to produce results. In practice, however, a "good average" will not be achieved until several samples have been averaged together; for example, a constant signal will take approximately 3 / \alpha stages to reach 95% of the actual value. To accurately reconstruct the original signal without information loss, all stages of the exponential moving average must also be available, because older samples decay in weight exponentially. This is in contrast to a simple moving average, in which some samples can be skipped without as much loss of information due to the constant weighting of samples within the average. If a known number of samples will be missed, one can adjust a weighted average for this as well, by giving equal weight to the new sample and all those to be skipped. This simple form of exponential smoothing is also known as an
exponentially weighted moving average (EWMA). Technically it can also be classified as an
autoregressive integrated moving average (ARIMA) (0,1,1) model with no constant term.
Time constant The
time constant of an exponential moving average is the amount of time for the smoothed response of a
unit step function to reach 1-1/e \approx 63.2\,\% of the final signal. The relationship between this time constant, \tau , and the smoothing factor, \alpha , is given by the following formula: :\alpha = 1 - e^{-\Delta T/\tau}, thus \tau = - \frac{\Delta T}{\ln(1 - \alpha)} where \Delta T is the sampling time interval of the discrete time implementation. If the sampling time is fast compared to the time constant (\Delta T \ll \tau) then, by using
the Taylor expansion of the exponential function, :\alpha \approx \frac{\Delta T} \tau , thus \tau \approx \frac{\Delta T} \alpha
Choosing the initial smoothed value Note that in the definition above, s_0 (the initial output of the exponential smoothing algorithm) is being initialized to x_0 (the initial raw data or observation). Because exponential smoothing requires that, at each stage, we have the previous forecast s_{t-1}, it is not obvious how to get the method started. We could assume that the initial forecast is equal to the initial value of demand; however, this approach has a serious drawback. Exponential smoothing puts substantial weight on past observations, so the initial value of demand will have an unreasonably large effect on early forecasts. This problem can be overcome by allowing the process to evolve for a reasonable number of periods (10 or more) and using the average of the demand during those periods as the initial forecast. There are many other ways of setting this initial value, but the smaller the value of \alpha, the more sensitive the forecast will be on the selection of this initial smoother value s_0.
Optimization For every exponential smoothing method, we also need to choose the value for the smoothing parameters. For simple exponential smoothing, there is only one smoothing parameter (
α), but for the methods that follow there are usually more than one smoothing parameter. There are cases where the smoothing parameters may be chosen in a subjective manner – the forecaster specifies the value of the smoothing parameters based on previous experience. However, a more robust and objective way to obtain values of the unknown parameters included in any exponential smoothing method is to estimate them from the observed data. The unknown parameters and the initial values for any exponential smoothing method can be estimated by minimizing the
sum of squared errors (SSE). The errors are specified as e_t=y_t-\hat{y}_{t\mid t-1} for t=1, \ldots,T (the one-step-ahead within-sample forecast errors) where y_t and \hat{y}_{t\mid t-1} are a variable to be predicted at t and a variable as the prediction result at t (based on the previous data or prediction), respectively. Hence, we find the values of the unknown parameters and the initial values that minimize : \text{SSE} = \sum_{t=1}^T (y_t-\hat{y}_{t\mid t-1})^2=\sum_{t=1}^T e_t^2 Unlike the regression case (where we have formulae to directly compute the regression coefficients which minimize the SSE) this involves a non-linear minimization problem, and we need to use an
optimization tool to perform this.
"Exponential" naming The name
exponential smoothing is attributed to the use of the exponential function as the filter
impulse response in the
convolution. By direct substitution of the defining equation for simple exponential smoothing back into itself we find that : \begin{align} s_t& = \alpha x_t + (1-\alpha)s_{t-1}\\[3pt] & = \alpha x_t + \alpha (1-\alpha)x_{t-1} + (1 - \alpha)^2 s_{t-2}\\[3pt] & = \alpha \left[x_t + (1-\alpha)x_{t-1} + (1-\alpha)^2 x_{t-2} + (1-\alpha)^3 x_{t-3} + \cdots + (1-\alpha)^{t-1} x_1 \right] + (1-\alpha)^t x_0. \end{align} In other words, as time passes the smoothed statistic s_t becomes the weighted average of a greater and greater number of the past observations s_{t-1},\ldots, s_{t-n},\ldots, and the weights assigned to previous observations are proportional to the terms of the geometric progression : 1, (1-\alpha), (1-\alpha)^2,\ldots, (1-\alpha)^n,\ldots A
geometric progression is the discrete version of an
exponential function, so this is where the name for this smoothing method originated according to
Statistics lore.
Comparison with moving average Exponential smoothing and moving average have similar defects of introducing a lag relative to the input data. While this can be corrected by shifting the result by half the window length for a symmetrical kernel, such as a moving average or gaussian, this approach is not possible for exponential smoothing since it is an
IIR filter and therefore has an asymmetric kernel and frequency-dependent
group delay. This means each constituent frequency is shifted by a different amount and therefore, there is no single number of samples that can be used to shift the output signal to account for the lag. Both filters also both have roughly the same distribution of forecast error when
α = 2/(
k + 1) where
k is the number of past data points in consideration of moving average. They differ in that exponential smoothing takes into account all past data, whereas moving average only takes into account
k past data points. Computationally speaking, they also differ in that moving average requires that the past
k data points, or the data point at lag
k + 1 plus the most recent forecast value, to be kept, whereas exponential smoothing only needs the most recent forecast value to be kept. In the
signal processing literature, the use of non-causal (symmetric) filters is commonplace, and the exponential
window function is broadly used in this fashion, but a different terminology is used: exponential smoothing is equivalent to a first-order
infinite-impulse response (IIR) filter and moving average is equivalent to a
finite impulse response filter with equal weighting factors. ==Double exponential smoothing (Holt linear)==