Spatial architecture

In computer science, spatial architectures are a kind of computer architecture leveraging many collectively coordinated and directly communicating processing elements (PEs) to quickly and efficiently run highly parallelizable kernels. The "spatial" term comes from processing element instances being typically arranged in an array or grid, both logically and in the silicon design. Their most common workloads consist of matrix multiplications, convolutions, or, in general, tensor contractions. As such, spatial architectures are often used in AI accelerators.

Design details

Core element of spatial architecture is its multidimensional array of processing elements. Each processing element is simple, namely a multiply-and-accumulate functional unit, a stripped-down core, or application-specific logic. Processing elements are then connected with each other and the memory hierarchy through busses or a network on chip, or even asynchronous logic. The memory hierarchy is explicitly managed and may consist of multiple on-chip buffers, like register files, scratchpads, and FIFOs, backed by large off-chip DRAM and non-volatile memories. The number of processing elements, interconnect bandwidth, and amount of on-chip memory vary widely between designs and target applications. From thousands of processing elements and tens of megabytes of memory for high-performance computing to tens of elements and a few kilobytes for the edge. The key performance metrics for a spatial architecture are its consumed energy and latency when running a given workload. Due to technology and bandwidth limitations, the energy and latency required to access larger memories, like DRAM, dominate those of computation, being hundreds of times more than what's needed for storage near processing elements. That's why a spatial architecture's memory hierarchy is intended to localize most repeated value accesses on faster and more efficient on-chip memories, exploiting data reuse to minimize costly accesses. Data reuse The mechanisms that enable reuse in spatial architectures are multicast and reduction. Reuse can be further classified as spatial and temporal. Spatial architectures' interconnects can support spatial multicast as one read from an outer memory being used for multiple writes to inner instances, and spatial reduction, where reads from inner memories are accumulated in a single outer write. These can be implemented either with direct element-to-element forwarding, like in systolic arrays, or on the interconnect during memory accesses. Temporal reuse occurs when the same value is retained in a memory while being read (multicast) and/or updated in place (reduction) multiple times without it being re-fetched from another memory. Consider the case of kernels that can be computed with parallel ALU-like processing elements, such as matrix multiplications and convolutions. Direct inter-processing-element communication can be used effectively for passing partial sums to achieve spatially distributed accumulation, or sharing the same input data for parallel computation without repeated accesses to outer memories. The amount of data reuse that can be exploited is a property of the kernel being run, and can be inferred by analyzing its data dependencies. When the kernel can be expressed as a loop nest, reuse arises from subsequent iterations accessing, in part, the same values. This overlap is a form of access locality and constitutes a reuse opportunity for spatial architectures often called "stationarity". For kernels presenting affine transformations of indices, like convolutions and, more generally, stencil patterns, the partial overlap arising from the sliding window of the computation also yields a reuse opportunity, taking the name of "ghost zone" or "halo". Naturally, spatial architectures are more effective the more reuse opportunities are present. At the same time, limited hardware resources mean that not all opportunities can be leveraged at once, requiring proper planning of the computation to exploit the most effective ones. ==Mapping computations==

Mapping computations

To run a kernel on a spatial architecture a mapping must be constructed, detailing how the execution will unfold. Mapping a workload to a spatial architecture requires binding each of its computations to a processing element and then scheduling both computations and the data movements required to support them. A good choice of mapping is crucial to maximize performance. The starting point for a mapping is the loop nest representation of a kernel. To leverage parallelism and data reuse simultaneously, iterations must be divided between processing elements while taking care that inter-iteration data dependencies are handled by the same element or neighborhood of elements. For illustrative purposes, the following mapping example focuses on a matrix multiplication, but everything remains generalizable to any data-parallel kernel. This choice stems from the fact that most works on spatial architectures focus on neural networks support and related optimizations. Note that a similar parallelization of matrix multiplication has also been discussed in the context of multiprocessor systems. All mappings can be constructed through three loop transformations of the original kernel: • Loop tiling (or blocking), resulting in smaller and smaller block matrices that can fit on inner memories. This is repeated for every memory in the hierarchy, for each creating a nested copy of the original loops. At any moment during execution, a memory only needs to store all data required for iterations on its copies of the loops and inner ones. • Parallelization, similar to tiling, but different tiles are processed concurrently across multiple processing elements, rather than sequentially. • Computation ordering, loops can be arbitrarily reordered inside each copy of the original loop nest. This alters data access patterns, changing which and when the same values are used in consecutive iterations. Each memory hierarchy level is in charge of progressing through its assigned iterations. After parallelization, each processing element runs the same inner loops on partially different data. Exploited data reuse is implicit in this formalism. Tiling enables temporal reuse, while parallelization enables spatial reuse. Finally, the order of computation determines which values can actually undergo reuse. Consider, for this example, a spatial architecture comprising a large scratchpad storing operands in their entirety and an array of processing elements, each with a small register file and a multiply-and-accumulate unit. Then, let the original kernel be this matrix multiplication: M, K, N = 128, 64, 256 for m in [0:M): for k in [0:K): for n in [0:N): Out[m][n] += W[m][k] * In[k][n] A complete mapping, with a tiled and parallelized kernel and a fully specified dataflow, can be written as follows. Iterations that are distributed across processing elements to run concurrently are marked with pfor: • outer memory for m_mem in [0:8): for k_mem in [0:1): for n_mem in [0:16): • across processing elements pfor m_par in [0:16): pfor k_par in [0:16): • inside processing elements for n_pe in [0:16): for m_pe in [0:1): for k_pe in [0:4): m = m_mem * 16 + m_par + m_pe k = k_mem * 16 * 4 + k_par * 4 + k_pe n = n_mem * 16 + n_pe Out[m][n] += W[m][k] * In[k][n] Temporal data reuse can be observed in the form of stationarity, with outputs accumulated in-place during k_pe iterations and weights remaining stationary during n_mem ones. m_par and n_par, while multicast of inputs and weights happens throughout k_par. --> Spatial reuse occurs as the reduction of outputs over k_par and the multicast of inputs throughout m_par. Each processing element sees instead a unique tile of weights. While inferring the reuse opportunities exploited by a mapping, any loop with a single iteration must be ignored, while, for each operand, only loops affecting its dimensions need to be considered. Mapping optimization The number of possible mappings achievable varies depending on the target hardware, but is hardly less than billions, due to the large number of possible combinations resulting from the above decisions. As a result, finding the best set of transformations yielding the highest data reuse, and thus the lowest running latency and power consumption for a spatial architecture and kernel, is a challenging optimization problem. Most spatial architecture designs have been developed together with a tailored mapping technique. The complexity of the problem has, however, motivated the development of dedicated mapping tools that can work for a variety of spatial architectures and implement general heuristics that consistently find good mappings within reasonable time. Techniques successfully applied to the problem include: • Pruned search, since many mappings yield identical behavior, they can be pruned as redundant (e.g., reordering loop with a single iteration). Then, a random search is often able to find good mappings. • Genetic algorithms have been used to iteratively improve an initial set of diverse random mappings by transplanting between them the most successful loop transformations. • Simulated annealing also starts from a random pool of mappings, iteratively applying to them random transformations. Each transformation is then retained with a probability that is proportional to its performance improvement and inversely proportional to the elapsed time since the start of the exploration. • Integer programming can be applied by reformulating the problem of assigning kernel iterations to different loops and reordering them as a generalized assignment problem. It can then be solved through dedicated tools, like Gurobi. ==Examples==

Examples

Spatial architecture platforms In 1992, "spatial machines" were suggested as an approach to parallel computation. In 2013, a programming standard was proposed for "spatial computing". Computer scientists at ETH Zurich have proposed a "spatial computer" model for energy-efficient parallel computation. AMD describe AMD XDNA as a "spatial dataflow NPU architecture". ASICs, fully custom hardware accelerator designs, are the most common form in which spatial architectures have been developed. This is mainly because ASICs mesh well with the efficiency design goals of spatial architectures. FPGAs can be seen as fine-grained and highly flexible spatial architectures. The same applies to CGRAs. However, both are not limited to following the spatial architecture paradigm, as they may, for example, be reconfigured to run most arbitrary tasks. Therefore, they should only be considered a spatial architecture when set up to operate as one. In fact, several spatial architecture designs have been developed for deployment on FPGAs. Systolic arrays are a form of spatial architecture, in that they employ a mesh of computing nodes with a programmable interconnect, allowing computations to unfold while data moves in lock-step from node to node. The computational flow graph of systolic arrays naturally aligns with the pfors of spatial architecture mappings. Asynchronous arrays of simple processors are a precursor of spatial architectures following the MIMD paradigm and targeted towards digital signal processing workloads. Dataflow architectures are also a forerunner of spatial architectures as a general-purpose approach to exploit parallelism across several functional units. They run a program by starting each computations as soon as its data dependencies are satisfied and the required hardware is available. Spatial architectures simplified this concept by targeting specific kernels. Rather than driving execution based on data readiness, they statically use the kernel's data dependencies to define the whole architecture's dataflow prior to execution through a mapping. Not spatial architectures Digital signal processors are highly specialized processors with custom datapaths to perform many arithmetic operations quickly, concurrently, and repeatedly on a series of data samples. Despite commonalities in target kernels, a single digital signal processor is not a spatial architecture, lacking its inherent spatial parallelism over an array of processing elements. Nonetheless, digital signal processors can be found in FPGAs and CGRAs, where they could be part of a there-instantiated, larger, spatial architecture design. Tensor Core present in Nvidia GPUs since the Volta series, while accelerating matrix multiplication, do not classify as spatial architectures either, as they are hardwired functional units that do not expose spatial features by themselves. Again, a streaming multiprocessor, containing multiple tensor cores, is not a spatial architecture, but an instance of SIMT, due to its control being shared across several GPU threads. Emergent or unconventional spatial architectures In-memory computing proposes to perform computations on the data directly inside the memory it is stored in. Its goal is to improve a computation's efficiency and density by sparing costly data transfers and reusing the existing memory hardware. For instance, one operand of a matrix multiplication could be stored in memory, while the other is gradually brought in, and the memory itself produces the final product. When each group of memory cells that performs a computation between the stored and an incoming operand, such as a multiplication, is viewed as a processing element, an in-memory computing-capable memory bank can be seen as a spatial architecture with a predetermined dataflow. The bank's width and height forming the characteristic pfors. Cognitive computers developed as part of research on neuromorphic systems are instances of spatial architectures targeting the acceleration of spiking neural networks. Each of their processing elements is a core handling several neurons and their synapses. It receives spikes directed to its neurons from other cores, integrates them, and eventually propagates produced spikes. Cores are connected through a network-on-chip and usually operate asynchronously. Their mapping consists of assigning neurons to cores while minimizing the total distance traveled by spikes, which acts as a proxy for energy and latency. Specific implementations architecture overview. architecture overview. architecture overview. architecture overview. Produced or prototyped spatial architectures as independent accelerators: • Eyeriss: a deep learning accelerator developed by MIT's CSAIL laboratory, in particular Vivienne Sze's team, and presented in 2016. It employs a 108 KB scratchpad and a grid of processing elements, each with a 0.5 KB register file. A successor, Eyeriss v2 has also been designed, implementing a hierarchical interconnect between processing elements to compensate for the lack of bandwidth in the original. • DianNao: a family of deep learning accelerators developed at ICT that offers both edge and high-performance computing oriented variants. Their base architecture uses reconfigurable arrays of multipliers, adders, and activation-specific functional units to parallelize most deep learning layers. • Simba: an experimental multi-chip module spatial architecture developed by Nvidia. Each chip has roughly 110 KB of memory and features 16 processing elements, each containing a vector multiply-and-accumulate unit capable of performing a dot product between 8-element vectors. Up to chips have been installed in the same module. • NVDLA: an open-source, parametric, unidimensional array of processing elements specialized for convolutions, developed by Nvidia. • Tensor Processing Unit (TPU): developed by Google and internally deployed in its datacenters since 2015, its first version employed a large systolic array capable of 92 TeraOps/second and a large 28 MB software-managed on-chip memory. Several subsequent versions have been developed with increasing capabilities. • TrueNorth: a neuromorphic chip produced by IBM in 2014. It features 4096 cores, capable of handling 256 simulated neurons and 64k synapses. It does not have a global clock and cores operate as event-driven by using both synchronous and asynchronous logic. Spatial architectures integrated into existing products or platforms: • Gemmini: systolic array-based deep learning accelerator developed by UC Berkeley as part of their open-source RISC-V ecosystem. Its base configuration is a array with 512 KB of memory, and is intended to be controlled via a tightly coupled core. • AI Engine: an accelerator developed by AMD and integrated in their Ryzen AI series of products. In it, each processing element is a SIMD-capable VLIW core, increasing the flexibility of the spatial architecture and enabling it to also exploit task parallelism. Workloads demonstrated to run on these spatial architectures include: AlexNet, ResNet, BERT, Scientific computing. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com