Status usage Upon completion of each ALU operation, the ALU's status output signals are usually stored in external
registers to make them available for future ALU operations (e.g., to implement
multiple-precision arithmetic) and for controlling
conditional branching. The bit registers that store the status output signals are often collectively treated as a single, multi-bit register, which is referred to as the "status register" or "condition code register". Depending on the ALU operation being performed, some status register bits may be changed and others may be left unmodified. For example, in bitwise logical operations such as AND and OR, the carry status bit is typically not modified as it is not relevant to such operations. In CPUs, the stored carry-out signal is usually connected to the ALU's carry-in net. This facilitates efficient propagation of carries (which may represent addition carries, subtraction borrows, or shift overflows) when performing multiple-precision operations, as it eliminates the need for software-management of carry propagation (via conditional branching, based on the carry status bit).
Operand and result data paths The sources of ALU operands and destinations of ALU results depend on the architecture of the encapsulating processor and the operation being performed. Processor architectures vary widely, but in general-purpose CPUs, the ALU typically operates in conjunction with a
register file (array of processor registers) or
accumulator register, which the ALU frequently uses as both a source of operands and a destination for results. To accommodate other operand sources, multiplexers are commonly used to select either the register file or alternative ALU operand sources as required by each machine instruction. For example, the architecture shown to the right employs a register file with two read ports, which allows the values stored in any two registers (or the same register) to be ALU operands. Alternatively, it allows either ALU operand to be sourced from an
immediate operand (a constant value which is directly encoded in the machine instruction) or from memory. The ALU result may be written to any register in the register file or to memory.
Multiple-precision arithmetic In integer arithmetic computations,
multiple-precision arithmetic is an algorithm that operates on integers which are larger than the ALU word size. To do this, the algorithm treats each integer as an ordered collection of ALU-size fragments, arranged from most-significant (MS) to least-significant (LS) or vice versa. For example, in the case of an 8-bit ALU, the 24-bit integer 0x123456 would be treated as a collection of three 8-bit fragments: 0x12 (MS), 0x34, and 0x56 (LS). Since the size of a fragment exactly matches the ALU word size, the ALU can directly operate on this "piece" of operand. The algorithm uses the ALU to directly operate on particular operand fragments and thus generate a corresponding fragment (a "partial") of the multi-precision result. Each partial, when generated, is written to an associated region of storage that has been designated for the multiple-precision result. This process is repeated for all operand fragments so as to generate a complete collection of partials, which is the result of the multiple-precision operation. In arithmetic operations (e.g., addition, subtraction), the algorithm starts by invoking an ALU operation on the operands' LS fragments, thereby producing both a LS partial and a carry out bit. The algorithm writes the partial to designated storage, whereas the processor's state machine typically stores the carry out bit to an ALU status register. The algorithm then advances to the next fragment of each operand's collection and invokes an ALU operation on these fragments along with the stored carry bit from the previous ALU operation, thus producing another (more significant) partial and a carry out bit. As before, the carry bit is stored to the status register and the partial is written to designated storage. This process repeats until all operand fragments have been processed, resulting in a complete collection of partials in storage, which comprise the multi-precision arithmetic result. In multiple-precision shift operations, the order of operand fragment processing depends on the shift direction. In left-shift operations, fragments are processed LS first because the LS bit of each partial—which is conveyed via the stored carry bit—must be obtained from the MS bit of the previously left-shifted, less-significant operand. Conversely, operands are processed MS first in right-shift operations because the MS bit of each partial must be obtained from the LS bit of the previously right-shifted, more-significant operand. In bitwise logical operations (e.g., logical AND, logical OR), the operand fragments may be processed in any arbitrary order because each partial depends only on the corresponding operand fragments (the stored carry bit from the previous ALU operation is ignored).
Binary fixed-point addition and subtraction Binary
fixed-point values are represented by integers. Consequently, for any particular fixed-point scale factor (or implied radix point position), an ALU can directly add or subtract two fixed-point operands and produce a fixed-point result. This capability is commonly used in both fixed-point and floating-point addition and subtraction. In floating-point addition and subtraction, the
significand of the smaller operand is right-shifted so that its fixed-point scale factor matches that of the larger operand. The ALU then adds or subtracts the aligned significands to produce a result significand. Together with other operand elements, the result significand is normalized and rounded to produce the floating-point result.
Complex operations Although it is possible to design ALUs that can perform complex functions, this is usually impractical due to the resulting increases in circuit complexity, power consumption, propagation delay, cost and size. Consequently, ALUs are typically limited to simple functions that can be executed at very high speeds (i.e., very short propagation delays), with more complex functions being the responsibility of software or external circuitry. For example: • In simple cases in which a CPU contains a single ALU, the CPU typically implements a complex operation by orchestrating a sequence of ALU operations according to a software algorithm. • More specialized architectures may use multiple ALUs to accelerate complex operations. In such systems, the ALUs are often
pipelined, with intermediate results passing through ALUs arranged like a factory
production line. Performance is greatly improved over that of a single ALU because all of the ALUs operate concurrently and software overhead is significantly reduced.
Graphics processing units Graphics processing units (GPUs) often contain hundreds or thousands of ALUs which can operate concurrently. Depending on the application and GPU architecture, the ALUs may be used to simultaneously process unrelated data or to operate in parallel on related data. An example of the latter is graphics rendering, in which multiple ALUs perform the same operation in parallel on a group of pixels, with each ALU operating on a pixel within a scene. ==Implementation==