VLIW and EPIC The instruction scheduling logic that makes a superscalar processor is
Boolean logic. In the early 1990s, a significant innovation was to realize that the coordination of a multi-ALU computer could be moved into the
compiler, the software that translates a programmer's instructions into machine-level instructions. This type of computer is called a
very long instruction word (VLIW) computer. Scheduling instructions statically in the compiler (versus scheduling dynamically in the processor) can reduce CPU complexity. This can improve performance, and reduce heat and cost. Unfortunately, the compiler lacks accurate knowledge of runtime scheduling issues. Merely changing the CPU core frequency multiplier will have an effect on scheduling. Operation of the program, as determined by input data, will have major effects on scheduling. To overcome these severe problems, a VLIW system may be enhanced by adding the normal dynamic scheduling, losing some of the VLIW advantages. Static scheduling in the compiler also assumes that dynamically generated code will be uncommon. Before the creation of
Java and the
Java virtual machine, this was true. It was reasonable to assume that slow compiles would only affect software developers. Now, with
just-in-time compilation (JIT) virtual machines being used for many languages, slow code generation affects users also. There were several unsuccessful attempts to computer can. Another issue is that compiler design for VLIW computers is very difficult, and compilers, as of 2005, often emit suboptimal code for these platforms. Also, VLIW computers optimise for throughput, not low latency, so they were unattractive to engineers designing controllers and other computers embedded in machinery. The
embedded systems markets had often pioneered other computer improvements by providing a large market unconcerned about compatibility with older software. In January 2000,
Transmeta Corporation took the novel step of placing a compiler in the central processing unit, and making the compiler translate from a reference byte code (in their case,
x86 instructions) to an internal VLIW instruction set. This method combines the hardware simplicity, low power and speed of VLIW RISC with the compact main memory system and software reverse-compatibility provided by popular CISC.
Intel's
Itanium chip is based on what they call an
explicitly parallel instruction computing (EPIC) design. This design supposedly provides the VLIW advantage of increased instruction throughput. However, it avoids some of the issues of scaling and complexity, by explicitly providing in each
bundle of instructions information concerning their dependencies. This information is calculated by the compiler, as it would be in a VLIW design. The early versions are also backward-compatible with newer
x86 software by means of an on-chip
emulator mode. Integer performance was disappointing and despite improvements, sales in volume markets continue to be low.
Multithreading Past designs worked best when the computer was running only one program. However, nearly all modern
operating systems allow running multiple programs together. For the CPU to change over and do work on another program needs costly
context switching. In contrast, multithreaded CPUs can handle instructions from multiple programs at once. To do this, such CPUs include several sets of registers. When a context switch occurs, the contents of the
working registers are simply copied into one of a set of registers for this purpose. Such designs often include thousands of registers instead of hundreds as in a typical design. On the downside, registers tend to be somewhat costly in chip space needed to implement them. This chip space might be used otherwise for some other purpose. Intel calls this technology "hyperthreading" and offers two threads per core in its current Core i3, Core i5, Core i7 and Core i9 Desktop lineup (as well as in its Core i3, Core i5 and Core i7 Mobile lineup), as well as offering up to four threads per core in high-end Xeon Phi processors.
Multi-core Multi-core CPUs are typically multiple CPU cores on the same die, connected to each other via a shared L2 or L3 cache, an on-die
bus, or an on-die
crossbar switch. All the CPU cores on the die share interconnect components with which to interface to other processors and the rest of the system. These components may include a
front-side bus interface, a
memory controller to interface with
dynamic random access memory (DRAM), a
cache coherent link to other processors, and a non-coherent link to the
southbridge and I/O devices. The terms
multi-core and
microprocessor unit (MPU) have come into general use for one die having multiple CPU cores. The development of multi-core CPUs was largely driven by the physical and thermal limitations of increasing clock speeds. By distributing computational tasks across several cores, systems can achieve higher performance without proportionally increasing power consumption and heat generation. This parallel processing capability allows modern operating systems to schedule multiple threads concurrently, leading to improved responsiveness and throughput, especially in multi-threaded applications. Many modern multi-core processors also incorporate simultaneous multithreading (SMT), a technology that allows each physical core to execute multiple threads concurrently. SMT enhances overall efficiency by making better use of the core’s resources during periods of low utilization, thus further optimizing performance without a significant increase in power draw. In addition to the shared cache and interconnect components mentioned earlier, advanced interconnect technologies have played a crucial role in boosting multi-core performance. Interfaces such as Intel's QuickPath Interconnect (QPI) and AMD's Infinity Fabric have been developed to provide high-bandwidth, low-latency communication channels between cores, memory, and other system components. These innovations reduce data transfer bottlenecks and contribute to a more cohesive and efficient processing environment. Moreover, the rise of heterogeneous computing has seen the integration of dedicated accelerators, such as GPUs and specialized co-processors, alongside multi-core CPUs. These systems offload specific tasks—like graphics rendering or machine learning computations—from the main CPU cores, allowing for a more balanced and efficient utilization of the overall system resources. This evolution in processor design continues to influence software development, where parallel and concurrent programming models are increasingly adopted to harness the full potential of multi-core architectures.
Intelligent RAM One way to work around the
Von Neumann bottleneck is to mix a processor and DRAM all on one chip. •
The Berkeley IRAM Project •
eDRAM •
Computational RAM •
Memristor Reconfigurable logic Another track of development is to combine reconfigurable logic with a general-purpose CPU. In this scheme, a special computer language compiles fast-running subroutines into a bit-mask to configure the logic. Slower, or less-critical parts of the program can be run by sharing their time on the CPU. This process allows creating devices such as software
radios, by using digital signal processing to perform functions usually performed by analog
electronics.
Open source processors As the lines between hardware and software increasingly blur due to progress in design methodology and availability of chips such as
field-programmable gate arrays (FPGA) and cheaper production processes, even
open source hardware has begun to appear. Loosely knit communities like
OpenCores and
RISC-V have recently announced fully open CPU architectures such as the
OpenRISC which can be readily implemented on FPGAs or in custom produced chips, by anyone, with no license fees, and even established processor makers like
Sun Microsystems have released processor designs (e.g.,
OpenSPARC) under open-source licenses.
Asynchronous CPUs Yet another option is a
clockless or
asynchronous CPU. Unlike conventional processors, clockless processors have no central clock to coordinate the progress of data through the pipeline. Instead, stages of the CPU are coordinated using logic devices called
pipe line controls or
FIFO sequencers. Basically, the pipeline controller clocks the next stage of logic when the existing stage is complete. Thus, a central clock is unneeded. Relative to clocked logic, it may be easier to implement high performance devices in asynchronous logic: • In a clocked CPU, no component can run faster than the clock rate. In a clockless CPU, components can run at different speeds. • In a clocked CPU, the clock can go no faster than the worst-case performance of the slowest stage. In a clockless CPU, when a stage finishes faster than normal, the next stage can immediately take the results rather than waiting for the next clock tick. A stage might finish faster than normal because of the type of data inputs (e.g., multiplication can be very fast if it occurs by 0 or 1), or because it is running at a higher voltage or lower temperature than normal. Asynchronous logic proponents believe these abilities would have these benefits: • lower power dissipation for a given performance • highest possible execution speeds The biggest disadvantage of the clockless CPU is that most CPU design tools assume a clocked CPU (a
synchronous circuit), so making a clockless CPU (designing an
asynchronous circuit) involves modifying the design tools to handle clockless logic and doing extra testing to ensure the design avoids
metastability problems. Even so, several asynchronous CPUs have been built, including • the
ORDVAC and the identical
ILLIAC I (1951) • the
ILLIAC II (1962), then the fastest computer on Earth • The Caltech Asynchronous Microprocessor, the world-first asynchronous microprocessor (1988) • the
ARM-implementing
AMULET (1993 and 2000) • the asynchronous implementation of
MIPS Technologies R3000, named MiniMIPS (1998) • the SEAforth
multi-core processor from
Charles H. Moore Optical communication In theory, an optical computer's components could directly connect through a holographic or phased open-air switching system. This would provide a large increase in effective speed and design flexibility, and a large reduction in cost. Since a computer's connectors are also its most likely failure points, a busless system may be more reliable. Further, as of 2010, modern processors use 64- or 128-bit logic. Optical wavelength superposition could allow data lanes and logic many orders of magnitude higher than electronics, with no added space or copper wires.
Optical processors Another long-term option is to use light instead of electricity for digital logic. In theory, this could run about 30% faster and use less power, and allow a direct interface with quantum computing devices. The main problems with this approach are that, for the foreseeable future, electronic computing elements are faster, smaller, cheaper, and more reliable. Such elements are already smaller than some wavelengths of light. Thus, even waveguide-based optical logic may be uneconomic relative to electronic logic. As of 2016, most development effort is for electronic circuitry.
Ionic processors Early experimental work has been done on using ion-based chemical reactions instead of electronic or photonic actions to implement elements of a logic processor.
Belt machine architecture Relative to conventional
register machine or
stack machine architecture, yet similar to Intel's
Itanium architecture, a temporal register addressing scheme has been proposed by Ivan Godard and company that is intended to greatly reduce the complexity of CPU hardware (specifically the number of internal registers and the resulting huge
multiplexer trees). While somewhat harder to read and debug than general-purpose register names, it aids understanding to view the belt as a moving
conveyor belt where the oldest values
drop off the belt and vanish. It is implemented in the Mill architecture. == Timeline of events ==