Instead of batch mode number crunching, the design would be tailored specifically to interactive use. This would include a built-in graphics engine and 2 GB of
RAM, running
BSD Unix 4.2. The machine would offer performance on par with contemporary Cray and
ETA Systems.
8 × 8 crossbar The basic idea of Leclerc's system was to use an 8×8
crossbar switch to connect eight custom
CMOS CPUs together at high speed. An extra channel on the crossbar allowed it to be connected to another crossbar, forming a single 16-processor unit. The units were 16-sized (instead of 8) in order to fully utilize a 16-bank high-speed memory that had been designed along with the rest of the system. Since memory was logically organized on the "far side" of the crossbars, the memory controller handled many of the tasks that would normally be left to the processors, including interrupt handling and
virtual memory translation, avoiding a trip through the crossbar for these housekeeping tasks. The resulting 16-unit processor/memory blocks could then be connected using another 8×8 crossbar, creating a 128-processor machine. Although the delays between the 16-unit blocks would be high, if the task could be cleanly separated into units the delay would not have a huge effect on performance. When data did have to be shared across the banks the system balanced the requests; first the "leftmost" processor in the queue would get access, then the "rightmost". Processors added their requests onto the proper end of the queue based on their physical location in the machine. It was felt that the simplicity and speed of this algorithm would make up for the potential gains of a more complex load-balancing system.
Instruction pipeline In order to allow the system to work even with the high inter-unit latencies, each processor used an 8-deep
instruction pipeline. Branches used a variable delay slot, the end of which was signaled by a bit in the next instruction. The bit indicated that the results of the branch had to be re-merged at this point, stalling the processor until this took place. Each processor also included a
floating point unit from
Weitek. For marketing purposes, each processor was called a "computational unit", and a card-cage populated with 16 was referred to as a "processor". This allowed favorable per-processor performance comparisons with other supercomputers of the era. The processors ran at 20 MHz in the integer units and 40 MHz for the FPUs, with the intention being to increase this to 50 MHz by the time it shipped. At about 12 Mflops peak per CU, the machine as a whole would deliver up to 1.5 Gflops, although due to the memory latencies this was typically closer to 250 Mflops. While this was fast for a CMOS machine processor of the time, it was hardly competitive for a supercomputer. Nevertheless, the machine was air cooled, and would have been the fastest such machine on the market. The machine ran an early version of the
Mach kernel for multi-processor support. The compilers were designed to keep the processors as full as possible by reducing the number of branch delay slots, and did a particularly good job of it. ==Fatal flaw==