Original concept In 1974, IBM began examining the possibility of constructing a
telephone switch to handle one million calls an hour, or about 300 calls per second. They calculated that each call would require 20,000 instructions to complete, and when timing overhead and other considerations were added, such a machine required performance of about 12 MIPS. This would require a significant advance in performance; their current top-of-the-line machine, the
IBM System/370 Model 168 of late 1972, offered about 3 MIPS. The group working on this project at the
Thomas J. Watson Research Center, including
John Cocke, designed a processor for this purpose. To reach the required performance, they considered the sort of operations such a machine required and removed any that were not appropriate. This led to the removal of a
floating-point unit for instance, which would not be needed in this application. More critically, they also removed many of the instructions that worked on data in
main memory and left only those instructions that worked on the internal
processor registers, as these were much faster to use and the simple code in a telephone switch could be written to use only these types of instructions. The result of this work was a conceptual design for a simplified processor with the required performance. The telephone switch project was canceled in 1975, but the team had made considerable progress on the concept and in October IBM decided to continue it as a general-purpose design. With no obvious project to attach it to, the team decided to call it the "801" after the building they worked in. For the general-purpose role, as opposed to the dedicated telephone system, the team began to consider real-world programs that would be run on a typical
minicomputer. IBM had collected enormous amounts of statistical data on the performance of real-world workloads on their machines and this data demonstrated that over half the time in a typical program was spent performing only five instructions: load value from memory, store value to memory, adding fixed-point numbers, comparing fixed-point numbers, and branching based on the result of those comparisons. This suggested that the same simplified processor design would work just as well for a general-purpose minicomputer as a special-purpose switch.
Rationale against use of microcode This conclusion flew in the face of contemporary processor design, which was based on the concept of using
microcode. IBM had been among the first to make widespread use of this technique as part of their
System/360 series. The 360s, and later the 370s, came in a variety of performance levels that all ran the same
machine language code. On the high-end machines, many of these instructions were implemented directly in hardware, like a
floating point unit, while low-end machines could instead simulate those instructions using a sequence of other instructions encoded in microcode. This allowed a single
application binary interface to run across the entire lineup and allowed the customers to feel confident that if more performance was ever needed they could move up to a faster machine without any other changes. Microcode allowed a simple processor to offer many instructions, which had been used by the designers to implement a wide variety of
addressing modes. For instance, an instruction like ADD might have a dozen variations, one that adds two numbers in internal registers, one that adds a register to a value in memory, one that adds two values from memory, etc. This allowed the programmer to select the exact version they needed for any particular task. The processor would read that instruction and use microcode to break it into a series of internal instructions. For instance, adding two numbers in memory might be implemented by loading those two numbers from memory into registers, adding them, and then storing the sum back to memory. The idea of offering all possible addressing modes for all instructions became a goal of processor designers, the concept becoming known as an
orthogonal instruction set. The 801 team noticed a side-effect of this concept; when faced with the plethora of possible versions of a given instruction,
compiler authors would usually pick a single version. This was typically the one that was implemented in hardware on the low-end machines. That ensured that the machine code generated by the compiler would run as fast as possible on the entire lineup. While using other versions of instructions might run even faster on a machine that implemented them in hardware, the complexity of knowing which one to pick on an ever-changing list of machines made this extremely unattractive, and compiler authors largely ignored these possibilities. As a result, the majority of the instructions available in the instruction set were never used in compiled programs. And it was here that the team made the key realization of the 801 project: Imposing microcode between a computer and its users imposes an expensive overhead in performing the most frequently executed instructions. Microcode takes a non-zero time to examine the instruction before it is performed. The same underlying processor with the microcode removed would eliminate this overhead and run those instructions faster. Since microcode essentially ran small
subroutines dedicated to a particular hardware implementation, it was ultimately performing the same basic task that the compiler was, implementing higher-level instructions as a sequence of machine-specific instructions. Simply removing the microcode and implementing that in the compiler could result in a faster machine. One concern was that programs written for such a machine would take up more memory; some tasks that could be accomplished with a single instruction on the 370 would have to be expressed as multiple instructions on the 801. For instance, adding two numbers from memory would require two load-to-register instructions, a register-to-register add, and then a store-to-memory. This could potentially slow the system overall if it had to spend more time reading instructions from memory than it formerly took to decode them. As they continued work on the design and improved their compilers, they found that overall program length continued to fall, eventually becoming roughly the same length as those written for the 370.
First implementations The initially proposed architecture was a machine with sixteen
24-bit registers and without
virtual memory. It used a two-operand format in the instruction, so that instructions were generally of the form A = A + B, as opposed to the three-operand format, A = B + C. The resulting CPU was operational by the summer of 1980 and was implemented using Motorola MECL-10K discrete component technology on large wire-wrapped custom boards. The CPU was clocked at 66 ns cycles (approximately 15.15 MHz) and could compute at the fast speed of approximately 15
MIPS. The 801 architecture was used in a variety of IBM devices, including
channel controllers for their S/370 mainframes (such as the
IBM 3090), various networking devices, and as a
vertical microcode execution unit in the 9373 and 9375 processors of the
IBM 9370 mainframe family. The original version of the 801 architecture was the basis for the architecture of the
IBM ROMP microprocessor ==Later modifications==