Prefix sum In the beginning of a
prefix sum operation, each processing element i owns a message m_i. The goal is to compute \bigoplus_{0 \le j \le i} m_j, where \oplus is an associative operation. The following pseudo code describes the algorithm.
Input: message m_i of processor i.
Output: prefix sum \bigoplus_{0 \le j \le i} m_j of processor i. x := m_i \sigma := m_i
for 0 \le k \le d - 1
do y := i \text{ XOR } 2^k
Send \sigma
to y
Receive m
from y \sigma := \sigma \oplus m
if bit k in i is set
then x := x \oplus m
endfor The algorithm works as follows. Observe that hypercubes of dimension d can be split into two hypercubes of dimension d - 1. Refer to the sub cube containing nodes with a leading 0 as the 0-sub cube and the sub cube consisting of nodes with a leading 1 as 1-sub cube. Once both sub cubes have calculated the prefix sum, the sum over all elements in the 0-sub cube has to be added to the every element in the 1-sub cube, since every processing element in the 0-sub cube has a lower rank than the processing elements in the 1-sub cube. The pseudo code stores the prefix sum in variable x and the sum over all nodes in a sub cube in variable \sigma. This makes it possible for all nodes in 1-sub cube to receive the sum over the 0-sub cube in every step. This results in a factor of \log p for T_\text{start} and a factor of n\log p for T_\text{byte}: T(n,p) = (T_\text{start} + nT_\text{byte})\log p.
All-gather / all-reduce All-gather operations start with each processing element having a message m_i. The goal of the operation is for each processing element to know the messages of all other processing elements, i.e. x := m_0 \cdot m_1 \dots m_p where \cdot is concatenation. The operation can be implemented following the algorithm template.
Input: message x := m_i at processing unit i.
Output: all messages m_1 \cdot m_2 \dots m_p. x := m_i
for 0 \le k
do y := i \text{ XOR } 2^k
Send x
to y
Receive x'
from y x := x \cdot x'
endfor With each iteration, the transferred message doubles in length. This leads to a runtime of T(n,p) \approx \sum_{j=0}^{d-1}(T_\text{start} + n \cdot 2^jT_\text{byte})= \log(p) T_\text{start} + (p-1)nT_\text{byte}. The same principle can be applied to the
All-Reduce operations, but instead of concatenating the messages, it performs a reduction operation on the two messages. So it is a
Reduce operation, where all processing units know the result. Compared to a normal reduce operation followed by a broadcast, All-Reduce in hypercubes reduces the number of communication steps.
All-to-all Here every processing element has a unique message for all other processing elements.
Input: message m_{ij} at processing element i to processing element j.
for d > k \geq 0
do Receive from processing element i \text{ XOR } 2^k: all messages for my k-dimensional sub cube
Send to processing element i \text{ XOR } 2^k: all messages for its k-dimensional sub cube
endfor With each iteration a messages comes closer to its destination by one dimension, if it hasn't arrived yet. Hence, all messages have reached their target after at most d = \log{p} steps. In every step, p / 2 messages are sent: in the first iteration, half of the messages aren't meant for the own sub cube. In every following step, the sub cube is only half the size as before, but in the previous step exactly the same number of messages arrived from another processing element. This results in a run-time of T(n,p) \approx \log{p} (T_\text{start} + \frac{p}{2}nT_\text{byte}). == ESBT-broadcast ==