MarketMemory management unit
Company Profile

Memory management unit

A memory management unit (MMU), sometimes called paged memory management unit (PMMU), is a computer hardware unit that examines all references to memory, and translates the memory addresses being referenced, known as virtual memory addresses, into physical addresses in main memory.

Types of address translation
Early systems used base and bounds addressing that further developed into segmentation, or used a fixed set of blocks instead of loading them on demand. The difference between these two approaches is the size of the contiguous block of memory; paged systems break up main memory into a series of equal-sized blocks, while segmented systems generally allow for variable sizes. Segmented translation In segmented translation, a memory address contains a segment number and an offset within the segment. Segments are variable-length and may have permissions, such as read, write, and execute, associated with them. A segment is loaded into a contiguous area of physical memory. Typically, the segment number is used as an index into a segment table; each entry in the segment table holds the address of the area of physical memory, the length of the segment, and other information such as permission flags. This style has the advantage of simplicity; the memory blocks are continuous, and thus only the two values, base and limit, need to be stored for mapping purposes. The disadvantage of this approach is that it leads to an effect known as external fragmentation. This occurs when memory allocations are released but are non-contiguous. In this case, enough memory may be available to handle a request, but this is spread out and cannot be allocated to a single segment. On systems where programs start and stop over time, this can eventually lead to memory being highly fragmented and no large blocks remaining; in this case, segments would need to be moved in memory, and their segment table entries modified to reflect the new physical address, to make a contiguous space large enough for a segment available. Some models of the PDP-11 16-bit minicomputer have a segmented memory management unit with a set of page address registers (PARs) and page description registers (PDRs); this maps a 16-bit virtual address to an 18-bit physical address. The PDP-11/70 expands that to produce a 22-bit physical address. Segmenting was widely used on microcomputer platforms of the 1980s. Among the MMUs that used this concept were the Motorola 68451 In paged translation, the address space (the range of addresses used by the processor) is divided into pages, each having a size which is a power of 2, usually a few kilobytes, but they may be much larger. Programs reference memory using the natural address size of the machine, typically 32 or 64 bits in modern systems. The bottom bits of the address (the offset within a page) are left unchanged. The upper address bits are the virtual page numbers. Most MMUs use an in-memory table of items called a page table, containing one page table entry (PTE) per virtual page, to map virtual page numbers to physical page numbers in main memory. Multi-level page tables are often used to reduce the size of the page table. An associative cache of PTEs is called a translation lookaside buffer (TLB) and is used to avoid the necessity of accessing the main memory every time a virtual address is mapped. a set of registers, or a one-or-more-level array of static RAM to store a set of mapping information. The MMU splits the virtual address into a virtual page number and an offset within the page. The virtual page number is used to select a page table entry; if the page table entry is found, and the page is marked as being in memory, the physical page number in the page table entry is combined with the offset to construct the physical address corresponding to the virtual address. Sometimes, a page table entry or other per-page information prohibits access to a particular virtual page, perhaps because no physical random-access memory (RAM) has been allocated to that virtual page. In this case, the MMU signals a page fault to the CPU. The operating system (OS) then handles the situation, perhaps by trying to find a spare frame of RAM and set up the page map to map it to the requested virtual address. If no RAM is free, it may be necessary to choose an existing page (known as a victim), using some replacement algorithm, and save it to disk (a process called paging). With some MMUs, there can also be a shortage of PTEs, in which case the OS will have to free one for the new mapping. Both Signetics and Philips produced a version of the 68000 that combined the 68905 on the same physical chip as the processor, the 68070. == Benefits ==
Benefits
An MMU can provide memory protection, so that attempts by code to access or modify data to which it does not have access result in a failure; this catches both software bugs and deliberate attempts to access or modify that data. Typically, an operating system assigns each program its own virtual address space. == Other memory-related MMU functions ==
Other memory-related MMU functions
Some MMUs, such as the Signetics 68905, also included a controller to manage a processor cache, which stores recently accessed data in a very fast memory and thus reduces the need to talk to the slower main memory. In some implementations, MMUs are also responsible for bus arbitration, controlling access to the memory bus among the many parts of the computer that desire access. == IOMMUs ==
IOMMUs
Hardware that performs direct memory access completely bypasses the CPU's MMU. Some I/O systems allow DMA hardware to use physical addresses to read or write directly anywhere in physical RAM without any restrictions. Other I/O systems have a separate MMU called an input-output memory management unit (IOMMU) that can be programmed by the OS to translate device addresses from the DMA hardware to specific, limited physical pages chosen by the OS. The IOMMU can be used to block DMA attacks. ==Examples==
Examples
Most modern systems divide memory into pages that are in size, often with the capability to use so-called huge pages of or in size (often both variants are possible). Page translations are cached in a translation lookaside buffer (TLB). Some systems, mainly older RISC designs, trap into the OS when a page translation is not found in the TLB. Most systems use a hardware-based tree walker. Most systems allow the MMU to be disabled, but some disable the MMU when trapping into OS code. IBM System/360 Model 67, IBM System/370, and successors The IBM System/360 Model 67, which was introduced August, 1965, included an MMU called a dynamic address translation (DAT) box. It has the unusual feature of storing accessed and dirty bits outside of the page table (along with the four bit protection key for all S/360 processors). They refer to physical memory rather than virtual memory, and are accessed by special-purpose instructions. which is very small. An OS may treat multiple pages as if they were a single larger page. For example, Linux on VAX groups eight pages together. Thus, the system is viewed as having pages. The VAX divides memory into four fixed-purpose regions, each in size. They are: DEC Alpha DEC Alpha processors divide memory into , , , or ; the page size is dependent on the processor. pages. After a TLB miss, low-level firmware machine code (here called PALcode) walks a page table. The OpenVMS AXP PALcode and DEC OSF/1 PALcode walk a three-level tree-structured page table. Addresses are broken down into an unused set of bits (containing the same value as the uppermost bit of the index into the root level of the tree), a set of bits to index the root level of the tree, a set of bits to index the middle level of the tree, a set of bits to index the leaf level of the tree, and remaining bits that pass through to the physical address without modification, indexing a byte within the page. The sizes of the fields are dependent on the page size; all three tree index fields are the same size. The Windows NT AXP PALcode supports a page being accessible only from kernel mode or being accessible from user and kernel mode, and also supports a fault on write bit. The Sun-2 workstations are similar; they are built around the Motorola 68010 microprocessor and have a similar memory management unit, with pages and segments. The context register has a 3-bit system context used in supervisor state and a 3-bit user context used in user state. The Sun-3 workstations, except for the Sun-3/80, Sun-3/460, Sun-3/470, and Sun-3/480, are built around the Motorola 68020, and have a similar memory management unit. The page size is increased to . (The later models are built around the Motorola 68030 and use the 68030's on-chip MMU.) The Sun-4 workstations are built around various SPARC microprocessors, and have a memory management unit similar to that of the Sun-3 workstations. PowerPC In PowerPC G1, G2, G3, and G4 pages are normally . After a TLB miss, the standard PowerPC MMU begins two simultaneous lookups. One lookup attempts to match the address with one of four or eight data block address translation (DBAT) registers, or four or eight instruction block address translation registers (IBAT), as appropriate. The BAT registers can map linear chunks of memory as large as , and are normally used by an OS to map large portions of the address space for the OS kernel's own use. If the BAT lookup succeeds, the other lookup is halted and ignored. The other lookup, not directly supported by all processors in this family, is via a so-called inverted page table, which acts as a hashed off-chip extension of the TLB. First, the top four bits of the address are used to select one of 16 segment registers. Then 24 bits from the segment register replace those four bits, producing a 52-bit address. The use of segment registers allows multiple processes to share the same hash table. The 52-bit address is hashed, then used as an index into the off-chip table. There, a group of eight-page table entries is scanned for one that matches. If none match due to excessive hash collisions, the processor tries again with a slightly different hash function. If this, too, fails, the CPU traps into OS (with MMU disabled) so that the problem may be resolved. The OS needs to discard an entry from the hash table to make space for a new entry. The OS may generate the new entry from a more normal tree-like page table or from per-mapping data structures, which are likely to be slower and more space-efficient. Support for no-execute control is in the segment registers, leading to granularity. A major problem with this design is poor cache locality caused by the hash function. Tree-based designs avoid this by placing the page table entries for adjacent pages in adjacent locations. An operating system running on the PowerPC may minimize the size of the hash table to reduce this problem. It is also somewhat slow to remove the page table entries of a process. The OS may avoid reusing segment values to delay facing this, or it may elect to suffer the waste of memory associated with per-process hash tables. G1 chips do not search for page table entries, but they do generate the hash, with the expectation that an OS will search the standard hash table via software. The OS can write to the TLB. G2, G3, and early G4 chips use hardware to search the hash table. The latest chips allow the OS to choose either method. On chips that make this optional or do not support it at all, the OS may choose to use a tree-based page table exclusively. x86 The x86 architecture has evolved over a very long time while maintaining full software compatibility, even for OS code. Thus, the MMU is extremely complex, with many different possible operating modes. The 8086/8088 and 80186/80188 have no memory management unit; they support segmentation, but only to support more physical memory than a 16-bit address can support, as the segment number, in a segment register, is multiplied by 16 and added to the segment offset to generate a physical address. The 80286 added an MMU that supports segmentation, but not paging. When segmentation is enabled by turning on protected mode, the segment number acts as an index into a table of segment descriptors; a segment descriptor contains a base physical address, a segment length, a presence bit to indicate whether the segment is currently in memory, permission bits, and control bits, If the offset in the segment is within the bounds specified by the segment descriptor, that offset is added to the base physical address to generate a physical address. The 80386, which introduced the 32-bit IA-32 version of x86, and subsequent x86 CPUs, support segmentation and paging. If paging is enabled, the base address in a segment descriptor is an address in a linear paged address space divided into pages, so when that is added to the offset in the segment, the resulting address is a linear address in that address space; in IA-32, that address is then masked to be no larger than 32 bits. The result may be looked up via a tree-structured page table, with the bits of the address being split as follows: 10 bits for the branch of the tree, 10 bits for the leaves of the branch, and the 12 lowest bits being directly copied to the result. Segment registers, used in pre-80386 CPUs to extend the address space, are not used in modern OSes, with one major exception: access to thread-specific data for applications or CPU-specific data for OS kernels, which is done with explicit use of the FS and GS segment registers. All memory access involves a segment register, chosen according to the code being executed. Except when using FS or GS, the OS ensures that the offset will be zero. Some operating systems, such as OpenBSD with its W^X feature, and Linux with the Exec Shield or PaX patches, may also limit the length of the code segment, as specified by the CS register, to disallow execution of code in modifiable regions of the address space. Minor revisions of the MMU introduced with the Pentium have allowed very large pages by skipping the bottom level of the tree (this leaves 10 bits for indexing the first level of page hierarchy with the remaining 10+12 bits being directly copied to the result). Minor revisions of the MMU introduced with the Pentium Pro introduced the physical address extension (PAE) feature, enabling 36-bit physical addresses with 2+9+9 bits for three-level page tables and 12 lowest bits being directly copied to the result. Large pages () are also available by skipping the bottom level of the tree (resulting in 2+9 bits for a two-level table hierarchy and the remaining 9+12 lowest bits copied directly). In addition, the page attribute table allowed specification of cacheability by looking up a few high bits in a small on-CPU table. No-execute support was originally only provided on a per-segment basis, making it very awkward to use. More recent x86 chips provide a per-page NX bit (no-execute bit) in the PAE mode. The W^X, Exec Shield, and PaX mechanisms described above emulate per-page non-execute support on machines with x86 processors that lack the NX bit by setting the length of the code segment, with a performance loss and a reduction in the available address space. (HSA) creates a unified virtual address space for CPUs, GPUs and DSPs, obsoleting the mapping tricks and data copying. x86-64, the 64-bit version of the x86 architecture, almost entirely removes segmentation in favor of the flat memory model used by almost all operating systems for the 386 or newer processors. In long mode, all segment offsets are ignored, except for the FS and GS segments; linear addresses are 64-bit rather than 32-bit, with the lowest 48 bits of the address being significant. When used with pages, the page table tree has four levels instead of three, to handle the larger linear addresses; in some newer x86-64 processors, a fifth page table level can be enabled, to support 57-bit linear addresses. In all levels of the page table, the page table entry includes an NX bit. 48-bit linear addresses are divided as follows: 16 bits unused, nine bits each for four tree levels (for a total of 36 bits), and the 12 lowest bits directly copied to the result. With pages, there are only three levels of page table, for a total of 27 bits used in paging and 21 bits of offset. Some newer CPUs also support a page with two levels of paging and of offset. CPUID can be used to determine if pages are supported. In all three cases, the bits are required to be equal to the 48th bit, or in other words, the low are sign extended to the higher bits. This is done to allow further expansion of the addressable range, without compromising backwards compatibility. 57-bit linear addresses are divided as follows: 7 bits unused, nine bits each for five tree levels (for a total of 45 bits), and the 12 lowest bits directly copied to the result. The low are sign extended. ==Alternatives==
Alternatives
Burroughs B5000 series, B6x00/B7x00/B5900/A-series/Unisys MCP systems The Burroughs B5000 from 1961 was the first commercial system to support virtual memory (after the Atlas), has no need for an external MMU. The B5000 and its successors up to current Unisys ClearPath MCP (Libra) systems provide the two functions of an MMU — virtual memory addresses and memory protection — with a different architectural approach. Rather than add virtual memory onto a processor not designed for virtual memory, it is integrated in the core design of the processor/system so there is no requirement for an external unit to add functionality of address translation or bounds checking, resulting in unprecedented memory safety and security. First, in the mapping of virtual memory addresses, instead of needing an MMU, the machines are descriptor-based. The uppermost bit of a 48-bit memory word in the Burroughs B5000 series systems indicate whether the word is a user data or a descriptor/control word; descriptors were read-only to user processes and could only be updated by the system (hardware or operating system). Memory words in the B6x00/B7x00/B5900/A-series/Unisys MCP systems have 48 data bits and 3 tag bits (and later systems to current have 4 tag bits). Words whose tag is an odd number are read-only to user processes – descriptors have a tag of 5 and code words have a tag of 3. Each allocated memory block is given a master descriptor with the properties of the block — physical address, size, and whether it is present in main memory or not, in which case, on first access, the block must be allocated, or if there is an address, it is an address on secondary storage and must be loaded to main memory. All references to code and data are made using a descriptor. When a request is made to access the block for reading or writing, the hardware checks its presence via the presence bit (pbit) in the descriptor. A pbit of 1 indicates the presence of the block. In this case, the block can be accessed via the physical main-memory address in the descriptor. If the pbit is zero, a pbit interrupt is generated for the MCP (operating system) to make the block present. If the address field is zero, this is the first access to this block, and it is allocated (an init (initial) pbit). If the address field is non-zero, it is a disk address of the block, which has previously been rolled out — the block is fetched from disk, the pbit is set to one and the physical memory address updated to point to the block in memory. This makes descriptors equivalent to a page-table entry in an MMU system, but descriptors are free of a table. System performance can be monitored through the number of pbit interrupts, which is the count of accesses to descriptors with a zero bit. Init pbits indicate initial allocations or new memory blocks, but a high level of other pbits indicates that the system had to load a block from virtual memory secondary storage. In normal operation, this will not occur very much. If there are above average number of other pbits occurring, the system might be thrashing, indicating system overload. In production systems, the machine load will be more constant, and thus the system will be configured or tuned to avoid thrashing. All memory allocation is therefore completely automatic (one of the features of modern systems) and there is no way to allocate blocks other than this mechanism. There are no such calls as malloc or dealloc, since memory blocks are also automatically allocated on pbit interrupt or discarded. The scheme is also lazy, since a block will not be allocated until it is actually referenced. When memory is nearly full, the MCP examines the working set, trying compaction (since the system is segmented, not paged), deallocating read-only segments (such as code segments which can be restored from their original copy) and, as a last resort, rolling dirty (that is, updated) data segments out to disk. Another way these systems provide a function of an MMU is in protection. Since all accesses are via the descriptor, the hardware can check that all accesses are within bounds and, in the case of a write, that the process has write permission. The MCP system is inherently secure and thus has no need of an MMU to provide this level of memory protection. Blocks can be shared between processes via copy descriptors in process stacks (a stack is a special system structure representing the execution state of a process). Thus, some processes may have write permission, whereas others do not. Code segments are read-only, thus reentrant and shared between processes. Copy descriptors contain a 20-bit address field giving the index of the master descriptor in the master descriptor array. This also implements a very efficient and secure Interprocess Communication (IPC) mechanism. Blocks can easily be relocated, since only the master descriptor needs to be updated when a block's status changes. Another aspect is performance do MMU-based or non-MMU-based systems provide better performance? MCP systems may be implemented on top of standard hardware that has an MMU (for example, a standard PC). Even if the system implementation uses the MMU in some way, this will not be at all visible at the MCP level. == Bank switching ==
Bank switching
A simple technique called bank switching was widely used by early 8-bit microprocessors like the MOS 6502. For instance, the Atari MMU would express additional bits on the address bus to select among several banks of DRAM memory based on which of the chips was currently active, normally the CPU or ANTIC. This was used to expand the available memory on the Atari 130XE to 128 kB. The Commodore 128 used a similar approach. ==See also==
tickerdossier.comtickerdossier.substack.com