Distributed memory On
distributed memory computer architectures, SPMD implementations usually employ
message passing programming. A distributed memory computer consists of a collection of interconnected, independent computers, called nodes. For parallel execution, each node starts its own program and communicates with other nodes by sending and receiving messages, calling send/receive routines for that purpose. Other
parallelization directives such as
Barrier synchronization may also be implemented by messages. The messages can be sent by a number of communication mechanisms, such as
TCP/IP over
Ethernet, or specialized high-speed interconnects such as
InfiniBand or
Omni-Path. For distributed memory environments, serial sections of the program can be implemented by identical computation of the serial section on all nodes rather than computing the result on one node and sending it to the others, if that improves performance by reducing communication overhead. Nowadays, the programmer is isolated from the details of the message passing by standard interfaces, such as
PVM and
MPI. Distributed memory is the programming style used on parallel supercomputers from homegrown
Beowulf clusters to the largest clusters on the
Teragrid, as well as present
GPU-based supercomputers.
Shared memory On a
shared memory machine (a computer with several interconnected
CPUs that access the same memory space), the sharing can be implemented in the context of either physically shared memory or logically shared (but physically distributed) memory; in addition to the shared memory, the CPUs in the computer system can also include local (or private) memory. For either of these contexts, synchronization can be enabled with hardware enabled primitives (such as
compare-and-swap, or
fetch-and-add. For machines that do not have such hardware support, locks can be used and data can be "exchanged" across processors (or, more generally,
processes or
threads) by depositing the sharable data in a shared memory area. When the hardware does not support shared memory, packing the data as a "message" is often the most efficient way to program (logically) shared memory computers with large number of processors, where the physical memory is local to processors and accessing the memory of another processor takes longer. SPMD on a shared memory machine can be implemented by standard processes (heavyweight) or threads (lightweight). Shared memory
multiprocessing (both
symmetric multiprocessing, SMP, and
non-uniform memory access, NUMA) presents the programmer with a common memory space and the possibility to parallelize execution. With the (IBM) SPMD model the cooperating processors (or processes) take different paths through the program, using parallel directives (
parallelization and synchronization directives, which can utilize compare-and-swap and fetch-and-add operations on shared memory synchronization variables), and perform operations on data in the shared memory ("shared data"); the processors (or processes) can also have access and perform operations on data in their local memory ("private data"). In contrast, with fork-and-join approaches, the program starts executing on one processor and the execution splits in a parallel region, which is started when parallel directives are encountered; in a parallel region, the processors execute a parallel task on different data. A typical example is the parallel DO loop, where different processors work on separate parts of the arrays involved in the loop. At the end of the loop, execution is synchronized (with soft- or hard-barriers), and processors (processes) continue to the next available section of the program to execute. The (IBM) SPMD has been implemented in the current standard interface for shared memory multiprocessing,
OpenMP, which uses multithreading, usually implemented by lightweight processes, called
threads.
Combination of levels of parallelism Current computers allow exploiting many parallel modes at the same time for maximum combined effect. A distributed memory program using
MPI may run on a collection of nodes. Each node may be a shared memory computer and execute in parallel on multiple CPUs using OpenMP. Within each CPU, SIMD vector instructions (usually generated automatically by the compiler) and
superscalar instruction execution (usually handled transparently by the CPU itself), such as
pipelining and the use of multiple parallel functional units, are used for maximum single CPU speed. == Implementations ==