A
fused multiply–add (
FMA or
fmadd) is a floating-point multiply–add operation performed in one step (
fused operation), with a single rounding. That is, where an unfused multiply–add would compute the product , round it to
N significant bits, add the result to
a, and round back to
N significant bits, a fused multiply–add would compute the entire expression to its full precision before rounding the final result down to
N significant bits. A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products: •
Dot product •
Matrix multiplication •
Polynomial evaluation (e.g., with
Horner's rule) •
Newton's method for evaluating functions (from the inverse function) •
Convolution •
Neural network • Multiplication in
double-double arithmetic Fused multiply–add can usually be relied on to give more accurate results. However,
William Kahan has pointed out that it can give problems if used unthinkingly. If is evaluated as (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the term first) using fused multiply–add, then the result may be negative even when due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated. When implemented inside a
microprocessor, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2
N-bit adder to compute the sum properly. Another benefit of including this instruction is that it allows an efficient software implementation of
division (see
division algorithm) and
square root (see
methods of computing square roots) operations, thus eliminating the need for dedicated hardware for those operations.
Dot-product instruction Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit
SIMD registers a0×b0 + a1×b1 + a2×b2 + a3×b3 with single cycle throughput.
Support The FMA operation is included in
IEEE 754-2008. The
1999 standard of the
C programming language supports the FMA operation through the fma() standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas (). The
GCC and
Clang C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma, this can be globally controlled by the -ffp-contract command line option. The fused multiply–add operation was introduced as "multiply–add fused" in the IBM
POWER1 (1990) processor, but has been added to numerous processors: • IBM
POWER1 (1990) •
HP PA-8000 (1996) and above •
Hitachi SuperH SH-4 (1998) •
IBM z/Architecture (since 1998) •
SCE-
Toshiba Emotion Engine (1999) • Intel
Itanium (2001) • STI
Cell (2006) •
Fujitsu SPARC64 VI (2007) and above • (
MIPS-compatible)
Loongson-2F (2008) •
RISC-V instruction set (2010) • ARM processors with VFPv4 and/or NEONv2: •
ARM Cortex-M4F (2010) • STM32 Cortex-M33 (VFMA operation) •
ARM Cortex-A5 (2012) •
ARM Cortex-A7 (2013) •
ARM Cortex-A15 (2012) •
Qualcomm Krait (2012) •
Apple A6 (2012) • All
ARMv8 processors •
Fujitsu A64FX has "Four-operand FMA with Prefix Instruction". • x86 processors with
FMA3 and/or FMA4 instruction set • AMD
Bulldozer (2011, FMA4 only) • AMD
Piledriver (2012, FMA3 and FMA4) •
Intel Haswell (2013, FMA3 only) • AMD
Steamroller (2014, FMA3 and FMA4) • AMD
Excavator (2015, FMA3 and FMA4) • Intel
Skylake (2015, FMA3 only) • AMD
Zen (2017, FMA3 only) •
Elbrus-8SV (2018) GPUs and GPGPU boards: •
AMD GPUs (2009) and newer •
TeraScale 2 "Evergreen"-series based •
Graphics Core Next-based •
Nvidia GPUs (2010) and newer •
Fermi-based (2010) •
Kepler-based (2012) •
Maxwell-based (2014) •
Pascal-based (2016) •
Volta-based (2017) • Intel GPUs: • Integrated GPUs since
Sandy Bridge: •
Intel MIC (2012) •
ARM Mali T600 Series (2012) and above • Vector Processors: •
NEC SX-Aurora TSUBASA == Variants with two or more roundings ==