For β = 0, the function is linear: f(
x) =
x/2. For β = 1, the function is the
Sigmoid Linear Unit (SiLU). For β = 1.702, the function approximates
GeLU. With β → ∞, the function converges to
ReLU. Thus, the swish family smoothly
interpolates between a linear function and the ReLU function. Since \operatorname{swish}_\beta(x) = \operatorname{swish}_1(\beta x) / \beta, all instances of swish have the same shape as the default \operatorname{swish}_1 , zoomed by \beta . One usually sets \beta > 0. When \beta is trainable, this constraint can be enforced by \beta = e^b, where b is trainable. \operatorname{swish}_1(x) = \frac x2 + \frac{x^2}{4}-\frac{x^4}{48}+\frac{x^6}{480}+O\left(x^8\right) \begin{aligned} \operatorname{swish}_1(x) &= \frac{x}{2} \tanh \left(\frac{x}{2}\right) + \frac{x}{2}\\ \operatorname{swish}_1(x) + \operatorname{swish}_{-1}(x) &= x \tanh \left(\frac{x}{2}\right) \\ \operatorname{swish}_1(x) - \operatorname{swish}_{-1}(x) &= x \end{aligned} == Derivatives ==