Scatter/gather units were also a part of most vector computers, notably the
Cray X-MP and its follow-ons. In this case, the purpose was to efficiently store values in the limited resource of the vector registers. For instance, the
Cray-1 had eight 64-word vector registers, so data that contained values that had no effect on the outcome, like zeros in an addition, were using up valuable space that would be better used. By gathering non-zero values into the registers, and scattering the results back out, the registers could be used much more efficiently, leading to higher performance. However the Cray-1 vector memory reference instructions could only access memory in "constant stride" - which allowed fast access of contiguous data (stride 1) or by some other constant increment. With the introduction of gather and scatter instructions in the X-MP, this restriction was eliminated. This basic layout was widely copied in later
supercomputer designs, especially on the variety of models from Japan. As
microprocessor design improved during the 1990s, commodity CPUs began to add vector processing units. At first these tended to be simple, sometimes overlaying the CPU's general purpose registers, but over time these evolved into increasingly powerful systems that met and then surpassed the units in high-end supercomputers. By this time, scatter/gather instructions had been added to many of these designs.
x86-64 CPUs which support the
AVX2 instruction set can gather 32-bit and 64-bit elements with memory offsets from a base address. A second register determines whether the particular element is loaded, and faults occurring from invalid memory accesses by masked-out elements are suppressed. The
AVX-512 instruction set also contains (potentially masked) scatter operations. The
ARM instruction set's
Scalable Vector Extension includes gather and scatter operations on 8-, 16-, 32- and 64-bit elements.
InfiniBand has hardware support for gather/scatter. Without instruction-level gather/scatter, efficient implementations may need to be tuned for optimal performance, for example with
prefetching; libraries such as OpenMPI may provide such primitives. == See also ==