Prefix sum

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:y0 = x0 y1 = x0 + x1 y2 = x0 + x1+ x2 ...

Scan higher order function

In functional programming terms, the prefix sum may be generalized to any binary operation (not just the addition operation); the higher order function resulting from this generalization is called a scan, and it is closely related to the fold operation. Both the scan and the fold operations apply the given binary operation to the same sequence of values, but differ in that the scan returns the whole sequence of results from the binary operation, whereas the fold returns only the final result. For instance, the sequence of factorial numbers may be generated by a scan of the natural numbers using multiplication instead of addition: : Inclusive and exclusive scans : Programming language and library implementations of scan may be either inclusive or exclusive. An inclusive scan includes input when computing output (i.e., y_i = \bigoplus_{j=0}^i x_j) while an exclusive scan does not (i.e., y_i = \bigoplus_{j=0}^{i-1} x_j). In the latter case, implementations either leave undefined or accept a separate "" value with which to seed the scan. Either type of scan can be transformed into the other: an inclusive scan can be transformed into an exclusive scan by shifting the array produced by the scan right by one element and inserting the identity value at the left of the array. Conversely, an exclusive scan be transformed into an inclusive scan by shifting the array produced by the scan left and inserting the sum of the last element of the scan and the last element of the input array at the right of the array. The following table lists examples of the inclusive and exclusive scan functions provided by a few programming languages and libraries: The directive-based OpenMP parallel programming model supports both inclusive and exclusive scan support beginning with Version 5.0. ==Parallel algorithms==

Parallel algorithms

There are two key algorithms for computing a prefix sum in parallel. The first offers a shorter span and more parallelism but is not work-efficient. The second is work-efficient but requires double the span and offers less parallelism. These are presented in turn below. Algorithm 1: Shorter span, more parallel Hillis and Steele present the following parallel prefix sum algorithm: for i 2(n) do for j i then x x^i_j means the value of the th element of array in timestep . With a single processor this algorithm would run in time. However, if the machine has at least processors to perform the inner loop in parallel, the algorithm as a whole runs in time, the number of iterations of the outer loop. Algorithm 2: Work-efficient A work-efficient parallel prefix sum can be computed by the following steps. The idea of building in hardware a functional unit dedicated to computing multi-parameter prefix-sum was patented by Uzi Vishkin. a parallel implementation of the C++ standard template library which provides adapted versions for parallel computing of various algorithms. In order to concurrently calculate the prefix sum over data elements with processing elements, the data is divided into p+1 blocks, each containing \frac n {p+1} elements (for simplicity we assume that p+1 divides ). Note, that although the algorithm divides the data into p+1 blocks, only processing elements run in parallel at a time. In a first sweep, each PE calculates a local prefix sum for its block. The last block does not need to be calculated, since these prefix sums are only calculated as offsets to the prefix sums of succeeding blocks and the last block is by definition not succeeded. The offsets which are stored in the last position of each block are accumulated in a prefix sum of their own and stored in their succeeding positions. For being a small number, it is faster to do this sequentially, for a large , this step could be done in parallel as well. A second sweep is performed. This time the first block does not have to be processed, since it does not need to account for the offset of a preceding block. However, in this sweep the last block is included instead and the prefix sums for each block are calculated taking the prefix sum block offsets calculated in the previous sweep into account. function prefix_sum(elements) { n := size(elements) p := number of processing elements prefix_sum := [0...0] of size n do parallel i = 0 to p-1 { // i := index of current PE from j = i * n / (p+1) to (i+1) * n / (p+1) - 1 do { // This only stores the prefix sum of the local blocks store_prefix_sum_with_offset_in(elements, 0, prefix_sum) } } x = 0 for i = 1 to p { // Serial accumulation of total sum of blocks x += prefix_sum[i * n / (p+1) - 1] // Build the prefix sum over the first p blocks prefix_sum[i * n / (p+1)] = x // Save the results to be used as offsets in second sweep } do parallel i = 1 to p { // i := index of current PE from j = i * n / (p+1) to (i+1) * n / (p+1) - 1 do { offset := prefix_sum[i * n / (p+1)] // Calculate the prefix sum taking the sum of preceding blocks as offset store_prefix_sum_with_offset_in(elements, offset, prefix_sum) } } return prefix_sum } Improvement: In case that the number of blocks are too much that makes the serial step time-consuming by deploying a single processor, the Hillis and Steele algorithm can be used to accelerate the second phase. Distributed memory: Hypercube algorithm The Hypercube Prefix Sum Algorithm is well adapted for distributed memory platforms and works with the exchange of messages between the processing elements. It assumes to have p=2^d processor elements (PEs) participating in the algorithm equal to the number of corners in a -dimensional hypercube. Throughout the algorithm, each PE is seen as a corner in a hypothetical hyper cube with knowledge of the total prefix sum as well as the prefix sum of all elements up to itself (according to the ordered indices among the PEs), both in its own hypercube. • The algorithm starts by assuming every PE is the single corner of a zero dimensional hyper cube and therefore and are equal to the local prefix sum of its own elements. • The algorithm goes on by unifying hypercubes which are adjacent along one dimension. During each unification, is exchanged and aggregated between the two hyper cubes which keeps the invariant that all PEs at corners of this new hyper cube store the total prefix sum of this newly unified hyper cube in their variable . However, only the hyper cube containing the PEs with higher index also adds this to their local variable , keeping the invariant that only stores the value of the prefix sum of all elements at PEs with indices smaller or equal to their own index. In a -dimensional hyper cube with 2^d PEs at the corners, the algorithm has to be repeated times to have the 2^dzero-dimensional hyper cubes be unified into one -dimensional hyper cube. Assuming a duplex communication model where the of two adjacent PEs in different hyper cubes can be exchanged in both directions in one communication step, this means d=\log_2 p communication startups. i := Index of own processor element (PE) m := prefix sum of local elements of this PE d := number of dimensions of the hyper cube x = m; // Invariant: The prefix sum up to this PE in the current sub cube σ = m; // Invariant: The prefix sum of all elements in the current sub cube for (k=0; k Large message sizes: pipelined binary tree The Pipelined Binary Tree Algorithm is another algorithm for distributed memory platforms which is specifically well suited for large message sizes. Like the hypercube algorithm, it assumes a special communication structure. The processing elements (PEs) are hypothetically arranged in a binary tree (e.g. a Fibonacci Tree) with infix numeration according to their index within the PEs. Communication on such a tree always occurs between parent and child nodes. The infix numeration ensures that for any given PEj, the indices of all nodes reachable by its left subtree \color{Blue}{[l \dots j-1]} are less than and the indices \color{Blue}{[j+1 \dots r]} of all nodes in the right subtree are greater than . The parent's index is greater than any of the indices in PEj's subtree if PEj is a left child and smaller if PEj is a right child. This allows for the following reasoning: • The local prefix sum \color{Blue}{\oplus[l \dots j-1]} of the left subtree has to be aggregated to calculate PEj's local prefix sum \color{Blue}{\oplus[l \dots j]}. • The local prefix sum \color{Blue}{\oplus[j+1 \dots r]} of the right subtree has to be aggregated to calculate the local prefix sum of higher level PEh which are reached on a path containing a left children connection (which means h > j). • The total prefix sum \color{Red}{\oplus[0 \dots j]} of PEj is necessary to calculate the total prefix sums in the right subtree (e.g. \color{Red}{\oplus[0 \dots j \dots r]} for the highest index node in the subtree). • PEj needs to include the total prefix sum \color{Red}{\oplus[0 \dots l-1]} of the first higher order node which is reached via an upward path including a right children connection to calculate its total prefix sum. Note the distinction between subtree-local and total prefix sums. The points two, three and four can lead to believe they would form a circular dependency, but this is not the case. Lower level PEs might require the total prefix sum of higher level PEs to calculate their total prefix sum, but higher level PEs only require subtree local prefix sums to calculate their total prefix sum. The root node as highest level node only requires the local prefix sum of its left subtree to calculate its own prefix sum. Each PE on the path from PE0 to the root PE only requires the local prefix sum of its left subtree to calculate its own prefix sum, whereas every node on the path from PEp-1 (last PE) to the PEroot requires the total prefix sum of its parent to calculate its own total prefix sum. This leads to a two-phase algorithm: ;Upward Phase:Propagate the subtree local prefix sum \color{Blue}{\oplus[l \dots j \dots r]} to its parent for each PEj. ;Downward phase:Propagate the exclusive (exclusive PEj as well as the PEs in its left subtree) total prefix sum \color{Red}{\oplus[0 \dots l-1]} of all lower index PEs which are not included in the addressed subtree of PEj to lower level PEs in the left child subtree of PEj. Propagate the inclusive prefix sum \color{Red}{\oplus[0 \dots j]} to the right child subtree of PEj. Note that the algorithm is run in parallel at each PE and the PEs will block upon receive until their children/parents provide them with packets. k := number of packets in a message m of a PE m @ {left, right, parent, this} := // Messages at the different PEs x = m @ this // Upward phase - Calculate subtree local prefix sums for j=0 to k-1: // Pipelining: For each packet of a message if hasLeftChild: blocking receive m[j] @ left // This replaces the local m[j] with the received m[j] // Aggregate inclusive local prefix sum from lower index PEs x[j] = m[j] ⨁ x[j] if hasRightChild: blocking receive m[j] @ right // We do not aggregate m[j] into the local prefix sum, since the right children are higher index PEs send x[j] ⨁ m[j] to parent else: send x[j] to parent // Downward phase for j=0 to k-1: m[j] @ this = 0 if hasParent: blocking receive m[j] @ parent // For a left child m[j] is the parents exclusive prefix sum, for a right child the inclusive prefix sum x[j] = m[j] ⨁ x[j] send m[j] to left // The total prefix sum of all PE's smaller than this or any PE in the left subtree send x[j] to right // The total prefix sum of all PE's smaller or equal than this PE Pipelining If the message of length can be divided into packets and the operator ⨁ can be used on each of the corresponding message packets separately, pipelining is possible. If the algorithm is used without pipelining, there are always only two levels (the sending PEs and the receiving PEs) of the binary tree at work while all other PEs are waiting. If there are processing elements and a balanced binary tree is used, the tree has \log _{2}p levels, the length of the path from PE_0 to PE_\mathrm{root} is therefore \log _{2}p - 1 which represents the maximum number of non parallel communication operations during the upward phase, likewise, the communication on the downward path is also limited to \log _{2}p -1 startups. Assuming a communication startup time of T_\mathrm{start} and a bytewise transmission time of T_\mathrm{byte}, upward and downward phase are limited to (2\log _{2}p-2)(T_\mathrm{start} + n\cdot T_\mathrm{byte}) in a non pipelined scenario. Upon division into k packets, each of size \tfrac{n}{k} and sending them separately, the first packet still needs (\log _{2}p-1)\left (T_\mathrm{start} + \frac{n}{k} \cdot T_\mathrm{byte}\right) to be propagated to PE_{\mathrm{root}} as part of a local prefix sum and this will occur again for the last packet if k > \log_{2}p. However, in between, all the PEs along the path can work in parallel and each third communication operation (receive left, receive right, send to parent) sends a packet to the next level, so that one phase can be completed in 2\log_{2}p-1 + 3(k-1) communication operations and both phases together need (4\cdot\log_{2}p-2 + 6(k-1))\left(T_\mathrm{start} + \frac{n}{k} \cdot T_\mathrm{byte}\right) which is favourable for large message sizes . The algorithm can further be optimised by making use of full-duplex or telephone model communication and overlapping the upward and the downward phase. ==Data structures==

Data structures

When a data set may be updated dynamically, it may be stored in a Fenwick tree data structure. This structure allows both the lookup of any individual prefix sum value and the modification of any array value in logarithmic time per operation. However, an earlier 1982 paper presents a data structure called Partial Sums Tree (see Section 5.1) that appears to overlap Fenwick trees; in 1982 the term prefix-sum was not yet as common as it is today. For higher-dimensional arrays, the summed area table provides a data structure based on prefix sums for computing sums of arbitrary rectangular subarrays. This can be a helpful primitive in image convolution operations. ==Applications==

Applications

Counting sort is an integer sorting algorithm that uses the prefix sum of a histogram of key frequencies to calculate the position of each key in the sorted output array. It runs in linear time for integer keys that are smaller than the number of items, and is frequently used as part of radix sort, a fast algorithm for sorting integers that are less restricted in magnitude. In the parallel random access machine model of computing, prefix sums can be used to simulate parallel algorithms that assume the ability for multiple processors to access the same memory cell at the same time, on parallel machines that forbid simultaneous access. By means of a sorting network, a set of parallel memory access requests can be ordered into a sequence such that accesses to the same cell are contiguous within the sequence; scan operations can then be used to determine which of the accesses succeed in writing to their requested cells, and to distribute the results of memory read operations to multiple processors that request the same result. parallel prefix operations form part of the formalization of the data parallelism model provided by machines such as the Connection Machine. The Connection Machine CM-1 and CM-2 provided a hypercubic network on which the Algorithm 1 above could be implemented, whereas the CM-5 provided a dedicated network to implement Algorithm 2. In the construction of Gray codes, sequences of binary values with the property that consecutive sequence values differ from each other in a single bit position, a number can be converted into the Gray code value at position of the sequence simply by taking the exclusive or of and (the number formed by shifting right by a single bit position). The reverse operation, decoding a Gray-coded value into a binary number, is more complicated, but can be expressed as the prefix sum of the bits of , where each summation operation within the prefix sum is performed modulo two. A prefix sum of this type may be performed efficiently using the bitwise Boolean operations available on modern computers, by computing the exclusive or of with each of the numbers formed by shifting to the left by a number of bits that is a power of two. Parallel prefix (using multiplication as the underlying associative operation) can also be used to build fast algorithms for parallel polynomial interpolation. In particular, it can be used to compute the divided difference coefficients of the Newton form of the interpolation polynomial. This prefix based approach can also be used to obtain the generalized divided differences for (confluent) Hermite interpolation as well as for parallel algorithms for Vandermonde systems. The core idea is that, for example, the solutions to the Bayesian/Kalman filtering problems are written in terms of a suitably defined associative filtering operator such that the prefix "sums" of the filtering operator gives the filtering solution. This allows parallel prefix algorithms to be applied to compute the filtering and smoothing solutions. A similar idea also works for the parallelization of a class of probabilistic differential equation solvers in the context of Probabilistic numerics. In the context of Optimal control, parallel prefix algorithms can be used for parallelization of Bellman equation and Hamilton–Jacobi–Bellman equations (HJB equations), including their Linear–quadratic regulator special cases. Here, the idea is that we can define an associative operator for a combination of conditional value functions (conditioned on the end-point), and the prefix sums of this operator give solutions to the Bellman equations or HJB equations. Prefix sum is used for load balancing as a low-cost algorithm to distribute the work between multiple processors, where the overriding goal is achieving an equal amount of work on each processor. The algorithms uses an array of weights representing the amount of work required for each item. After the prefix sum is calculated, the work item is sent for processing to the processor unit with the number . Graphically this corresponds to an operation where the amount of work in each item is represented by the length of a linear segment, all segments are sequentially placed onto a line and the result cut into number of pieces, corresponding to the number of the processors. Below is a lookup table of quarter squares with the remainder discarded for the digits 0 through 18; this allows for the multiplication of numbers up to . If, for example, you wanted to multiply 9 by 3, you observe that the sum and difference are 12 and 6 respectively. Looking both those values up on the table yields 36 and 9, the difference of which is 27, which is the product of 9 and 3. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com