Spatial architecture platforms In 1992, "spatial machines" were suggested as an approach to parallel computation. In 2013, a programming standard was proposed for "spatial computing". Computer scientists at
ETH Zurich have proposed a "spatial computer" model for energy-efficient parallel computation. AMD describe
AMD XDNA as a "spatial dataflow NPU architecture".
ASICs, fully custom hardware accelerator designs, are the most common form in which spatial architectures have been developed. This is mainly because ASICs mesh well with the efficiency design goals of spatial architectures.
FPGAs can be seen as fine-grained and highly flexible spatial architectures. The same applies to
CGRAs. However, both are not limited to following the spatial architecture paradigm, as they may, for example, be
reconfigured to run most arbitrary tasks. Therefore, they should only be considered a spatial architecture when set up to operate as one. In fact, several spatial architecture designs have been developed for deployment on FPGAs.
Systolic arrays are a form of spatial architecture, in that they employ a mesh of computing nodes with a programmable interconnect, allowing computations to unfold while data moves in lock-step from node to node. The computational flow graph of systolic arrays naturally aligns with the pfors of spatial architecture mappings.
Asynchronous arrays of simple processors are a precursor of spatial architectures following the
MIMD paradigm and targeted towards
digital signal processing workloads.
Dataflow architectures are also a forerunner of spatial architectures as a
general-purpose approach to exploit parallelism across several functional units. They run a
program by starting each computations as soon as its data dependencies are satisfied and the required hardware is available. Spatial architectures simplified this concept by targeting specific kernels. Rather than driving execution based on data readiness, they statically use the kernel's data dependencies to define the whole architecture's dataflow prior to execution through a mapping.
Not spatial architectures Digital signal processors are highly specialized processors with custom datapaths to perform many arithmetic operations quickly, concurrently, and repeatedly on a series of data samples. Despite commonalities in target kernels, a single digital signal processor is not a spatial architecture, lacking its inherent spatial parallelism over an array of processing elements. Nonetheless, digital signal processors can be found in FPGAs and CGRAs, where they could be part of a there-instantiated, larger, spatial architecture design.
Tensor Core present in
Nvidia GPUs since the Volta series, while accelerating matrix multiplication, do not classify as spatial architectures either, as they are hardwired functional units that do not expose spatial features by themselves. Again, a
streaming multiprocessor, containing multiple tensor cores, is not a spatial architecture, but an instance of
SIMT, due to its control being shared across several GPU threads.
Emergent or unconventional spatial architectures In-memory computing proposes to perform computations on the data directly inside the
memory it is stored in. Its goal is to improve a computation's efficiency and density by sparing costly data transfers and reusing the existing memory hardware. For instance, one operand of a matrix multiplication could be stored in memory, while the other is gradually brought in, and the memory itself produces the final product. When each group of
memory cells that performs a computation between the stored and an incoming operand, such as a multiplication, is viewed as a processing element, an in-memory computing-capable memory bank can be seen as a spatial architecture with a predetermined dataflow. The bank's width and height forming the characteristic pfors.
Cognitive computers developed as part of research on
neuromorphic systems are instances of spatial architectures targeting the acceleration of
spiking neural networks. Each of their processing elements is a core handling several
neurons and their
synapses. It receives
spikes directed to its neurons from other cores,
integrates them, and eventually propagates produced spikes. Cores are connected through a network-on-chip and usually operate
asynchronously. Their mapping consists of assigning neurons to cores while minimizing the total distance traveled by spikes, which acts as a proxy for energy and latency.
Specific implementations architecture overview. architecture overview. architecture overview. architecture overview. Produced or prototyped spatial architectures as
independent accelerators: •
Eyeriss: a deep learning accelerator developed by
MIT's
CSAIL laboratory, in particular
Vivienne Sze's team, and presented in 2016. It employs a 108 KB scratchpad and a grid of processing elements, each with a 0.5 KB register file. A successor, Eyeriss v2 has also been designed, implementing a hierarchical interconnect between processing elements to compensate for the lack of bandwidth in the original. •
DianNao: a family of deep learning accelerators developed at
ICT that offers both edge and high-performance computing oriented variants. Their base architecture uses reconfigurable arrays of
multipliers,
adders, and
activation-specific functional units to parallelize most
deep learning layers. •
Simba: an experimental
multi-chip module spatial architecture developed by
Nvidia. Each chip has roughly 110 KB of memory and features 16 processing elements, each containing a vector multiply-and-accumulate unit capable of performing a
dot product between 8-element vectors. Up to chips have been installed in the same module. •
NVDLA: an
open-source, parametric, unidimensional array of processing elements specialized for convolutions, developed by
Nvidia. •
Tensor Processing Unit (TPU): developed by
Google and internally deployed in its datacenters since 2015, its first version employed a large systolic array capable of 92 TeraOps/second and a large 28 MB software-managed on-chip memory. Several subsequent versions have been developed with increasing capabilities. •
TrueNorth: a neuromorphic chip produced by
IBM in 2014. It features 4096 cores, capable of handling 256 simulated neurons and 64k synapses. It does not have a global
clock and cores operate as event-driven by using both synchronous and asynchronous logic. Spatial architectures integrated into existing products or
platforms: •
Gemmini: systolic array-based deep learning accelerator developed by
UC Berkeley as part of their open-source
RISC-V ecosystem. Its base configuration is a array with 512 KB of memory, and is intended to be controlled via a tightly coupled core. •
AI Engine: an accelerator developed by
AMD and integrated in their
Ryzen AI series of products. In it, each processing element is a
SIMD-capable
VLIW core, increasing the flexibility of the spatial architecture and enabling it to also exploit
task parallelism. Workloads demonstrated to run on these spatial architectures include:
AlexNet,
ResNet,
BERT,
Scientific computing. ==See also==