ISA In the original
IBM PC (and the follow-up
PC/XT), there was only one
Intel 8237 DMA controller capable of providing four DMA channels (numbered 0–3). These DMA channels performed 8-bit transfers (as the 8237 was an 8-bit device, ideally matched to the PC's
i8088 CPU/bus architecture), could only address the first (
i8086/8088-standard) megabyte of RAM, and were limited to addressing single 64
kB segments within that space (although the source and destination channels could address different segments). Additionally, the controller could only be used for transfers to, from or between expansion bus I/O devices, as the 8237 could only perform memory-to-memory transfers using channels 0 & 1, of which channel 0 in the PC (& XT) was dedicated to
dynamic memory refresh. This prevented it from being used as a general-purpose "
Blitter", and consequently, block memory moves in the PC, limited by the general PIO speed of the CPU, were very slow. With the
IBM PC/AT, the enhanced
AT bus (more familiarly retronymed as the
Industry Standard Architecture (ISA)) added a second 8237 DMA controller to provide three additional, and as highlighted by resource clashes with the XT's additional expandability over the original PC, much-needed channels (5–7; channel 4 is used as a cascade to the first 8237). ISA DMA's extended 24-bit address bus width allows it to access up to 16 MB of lower memory. The page register was also rewired to address the full 16 MB memory address space of the 80286 CPU. This second controller was also integrated in a way capable of performing 16-bit transfers when an I/O device is used as the data source and/or destination (as it actually only processes data itself for memory-to-memory transfers, otherwise simply
controlling the data flow between other parts of the 16-bit system, making its own data bus width relatively immaterial), doubling data throughput when the upper three channels are used. For compatibility, the lower four DMA channels were still limited to 8-bit transfers only, and whilst memory-to-memory transfers were now technically possible due to the freeing up of channel 0 from having to handle DRAM refresh, from a practical standpoint they were of limited value because of the controller's consequent low throughput compared to what the CPU could now achieve (i.e., a 16-bit, more optimised
80286 running at a minimum of 6 MHz, vs an 8-bit controller locked at 4.77 MHz). In both cases, the 64 kB
segment boundary issue remained, with individual transfers unable to cross segments (instead "wrapping around" to the start of the same segment) even in 16-bit mode, although this was in practice more a problem of programming complexity than performance as the continued need for DRAM refresh (however handled) to monopolise the bus approximately every 15
μs prevented use of large (and fast, but uninterruptible) block transfers. Due to their lagging performance (1.6
MB/s maximum 8-bit transfer capability at 5 MHz, but no more than in the PC/XT and for 16-bit transfers in the AT due to ISA bus overheads and other interference such as memory refresh interruptions) and unavailability of any speed grades that would allow installation of direct replacements operating at speeds higher than the original PC's standard 4.77 MHz clock, these devices have been effectively obsolete since the late 1980s.
80386 and 32-bit systems Particularly, the advent of the
80386 processor in 1985 and its capacity for 32-bit transfers (although great improvements in the efficiency of address calculation and block memory moves in Intel CPUs after the
80186 meant that PIO transfers even by the 16-bit-bus
286 and
386SX could still easily outstrip the 8237), as well as the development of further evolutions to (
EISA) or replacements for (
MCA,
VLB and
PCI) the "ISA" bus with their own much higher-performance DMA subsystems (up to a maximum of for EISA, MCA, typically VLB/PCI) made the original DMA controllers seem more of a performance millstone than a booster. They were supported to the extent that they are required to support built-in legacy PC hardware on later machines. The pieces of legacy hardware that continued to use ISA DMA after 32-bit expansion buses became common were
Sound Blaster cards that needed to maintain full hardware compatibility with the
Sound Blaster standard; and
Super I/O devices on motherboards that often integrated a built-in
floppy disk controller, an
IrDA infrared controller when FIR (fast infrared) mode is selected, and an
IEEE 1284 parallel port controller when ECP mode is selected. In cases where an original 8237s or direct compatibles were still used, transfer to or from these devices may still be limited to the first 16 MB of main
RAM regardless of the system's actual address space or amount of installed memory. Each DMA channel has a 16-bit address register and a 16-bit count register associated with it. To initiate a data transfer, the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer, read or write. It then instructs the DMA hardware to begin the transfer. When the transfer is complete, the device
interrupts the CPU. Scatter-gather or
vectored I/O DMA allows the transfer of data to and from multiple memory areas in a single DMA transaction. It is equivalent to the chaining together of multiple simple DMA requests. The motivation is to offload multiple
input/output interrupt and data copy tasks from the CPU. DRQ stands for
Data request; DACK for
Data acknowledge. These symbols, seen on hardware
schematics of computer systems with DMA functionality, represent electronic signaling lines between the CPU and DMA controller. Each DMA channel has one Request and one Acknowledge line. A device that uses DMA must be configured to use both lines of the assigned DMA channel. 16-bit ISA permitted bus mastering. Standard ISA DMA assignments:
PCI A
PCI architecture has no central DMA controller, unlike ISA. Instead, A PCI device can request control of the bus ("become the
bus master") and request to read from and write to system memory. More precisely, a PCI component requests bus ownership from the PCI bus controller (usually PCI host bridge, and PCI to PCI bridge), which will
arbitrate if several devices request bus ownership simultaneously, since there can only be one bus master at one time. When the component is granted ownership, it will issue normal read and write commands on the PCI bus, which will be claimed by the PCI bus controller. As an example, on an
Intel Core-based PC, the southbridge will forward the transactions to the
memory controller (which is
integrated on the CPU die) using
DMI, which will in turn convert them to DDR operations and send them out on the memory bus. As a result, there are quite a number of steps involved in a PCI DMA transfer; however, that poses little problem, since the PCI device or PCI bus itself are an order of magnitude slower than the rest of the components (see
list of device bandwidths). A modern x86 CPU may use more than 4 GB of memory, either utilizing the native 64-bit mode of
x86-64 CPU, or the
Physical Address Extension (PAE), a 36-bit addressing mode. In such a case, a device using DMA with a 32-bit address bus is unable to address memory above the 4 GB line. The new
Double Address Cycle (DAC) mechanism, if implemented on both the PCI bus and the device itself, enables 64-bit DMA addressing. Otherwise, the operating system would need to work around the problem by either using costly
double buffers (DOS/Windows nomenclature) also known as
bounce buffers (
FreeBSD/Linux), or it could use an
IOMMU to provide address translation services if one is present.
I/OAT As an example of DMA engine incorporated in a general-purpose CPU, some Intel
Xeon chipsets include a DMA engine called
I/O Acceleration Technology (I/OAT), which can offload memory copying from the main CPU, freeing it to do other work. In 2006, Intel's
Linux kernel developer Andrew Grover performed benchmarks using I/OAT to offload network traffic copies and found no more than 10% improvement in CPU utilization with receiving workloads.
DDIO Further performance-oriented enhancements to the DMA mechanism have been introduced in Intel
Xeon E5 processors with their
Data Direct I/O (
DDIO) feature, allowing the DMA "windows" to reside within
CPU caches instead of system RAM. As a result, CPU caches are used as the primary source and destination for I/O, allowing
network interface controllers (NICs) to DMA directly to the last-level cache (L3 cache) of local CPUs and avoid costly fetching of the I/O data from system RAM. As a result, DDIO reduces the overall I/O processing latency, allows processing of the I/O to be performed entirely in cache, prevents the available RAM bandwidth/latency from becoming a performance bottleneck, and may lower the power consumption by allowing RAM to remain longer in a low-powered state.
AHB In
systems-on-a-chip and
embedded systems, typical system bus infrastructure is a complex on-chip bus such as
AMBA High-performance Bus. AMBA defines two kinds of AHB components: master and slave. A slave interface is similar to programmed I/O through which the software (running on an embedded CPU, e.g.,
ARM) can write/read I/O registers or (less commonly) local memory blocks inside the device. A master interface can be used by the device to perform DMA transactions to/from system memory without heavily loading the CPU. Therefore, high bandwidth devices such as network controllers that need to transfer huge amounts of data to/from system memory will have two interface adapters to the AHB: a master and a slave interface. This is because on-chip buses like AHB do not support
tri-stating the bus or alternating the direction of any line on the bus. Like PCI, no central DMA controller is required since the DMA is bus-mastering, but an
arbiter is required in case of multiple masters present on the system. Internally, a multichannel DMA engine is usually present in the device to perform multiple concurrent
scatter-gather operations as programmed by the software.
Cell As an example usage of DMA in a
multiprocessor-system-on-chip, IBM/Sony/Toshiba's
Cell processor incorporates a DMA engine for each of its 9 processing elements, including one Power processor element (PPE) and eight synergistic processor elements (SPEs). Since the SPE's load/store instructions can read/write only its own local memory, an SPE entirely depends on DMAs to transfer data to and from the main memory and local memories of other SPEs. Thus, the DMA acts as a primary means of data transfer among cores inside this
CPU (in contrast to cache-coherent CMP architectures such as Intel's cancelled
general-purpose GPU,
Larrabee). DMA in Cell is fully
cache coherent (note, however, local stores of SPEs operated upon by DMA do not act as globally coherent cache in the
standard sense). In both read ("get") and write ("put"), a DMA command can transfer either a single block area of size up to 16 KB, or a list of 2 to 2048 such blocks. The DMA command is issued by specifying a pair of a local address and a remote address: for example when a SPE program issues a put DMA command, it specifies an address of its own local memory as the source and a virtual memory address (pointing to either the main memory or the local memory of another SPE) as the target, together with a block size. According to an experiment, an effective peak performance of DMA in Cell (3 GHz, under uniform traffic) reaches 200 GB per second. == DMA controller chips ==