University of Texas at Austin's implementation of the EDGE concept is the
TRIPS processor, the
Tera-op, Reliable, Intelligently adaptive Processing System. A TRIPS CPU is built by repeating a single basic functional unit as many times as needed. The TRIPS design's use of hyper-blocks that are loaded en-masse allows for dramatic gains in
speculative execution. Whereas a traditional design might have a few hundred instructions to examine for possible scheduling into the functional units, the TRIPS design has thousands, hundreds of instructions per hyper-block, and hundreds of hyper-blocks being examined. This leads to greatly improved functional unit utilization; scaling its performance to a typical four-issue superscalar design, TRIPS can process about three times as many instructions per cycle. Whilst traditional designs uses varying units, allowing more parallelism than the four-wide schedulers would otherwise allow, TRIPS, in order to keep all of the units active includes all types of instructions in the instruction stream. As this is often not the case in practice, traditional CPUs often have many idle functional units. In TRIPS, the individual units are general purpose, allowing any instruction to run on any core. Not only does this avoid the need to carefully balance the number of different kinds of cores, but it also means that a TRIPS design can be built with any number of cores needed to reach a particular performance requirement. A single-core TRIPS CPU, with a simplified (or eliminated) scheduler will run a set of hyperblocks exactly like one with hundreds of cores, only slower. The performance is also not dependent on the types of data being fed in, meaning that a TRIPS CPU will run a much wider variety of tasks at the same level performance. For instance, if a traditional CPU is fed a math-heavy workload, it will bog as soon as all the floating point units are busy, with the integer units lying idle. If it is fed a data intensive program like a database job, the floating point units will lie idle while the integer units bog. In a TRIPS CPU every functional unit will add to the performance of every task, because every task can run on every unit. The designers refer to as a "polymorphic processor". Like TRIPS, DSPs gain additional performance by limiting data inter-dependencies, but unlike TRIPS they do so by allowing only a very limited workflow to run on them. TRIPS would be just as fast as a custom DSP on these workloads, but equally able to run other workloads at the same time. Though a TRIPS processor likely couldn't be used to replace highly customized designs like
GPUs in modern
graphics cards, they may be able to replace or outperform many lower-performance chips like those used for media processing. The reduction of the global register file also results in non-obvious gains. The addition of new circuitry to modern processors has meant that their overall size has remained about the same even as they move to smaller process sizes. As a result, the relative distance to the register file has grown, and this limits the possible cycle speed due to communications delays. In EDGE the data is generally more local or isolated in well defined inter-core links, eliminating large "cross-chip" delays. This means the individual cores can be run at higher speeds, limited by the signalling time of the much shorter data paths. The combination of these two design changes effects greatly improves system performance. However, as of 2008, GPUs from
ATI and
NVIDIA have already exceeded the 1 teraflop barrier (albeit for specialized applications). As for traditional CPUs, a contemporary (2007)
Mac Pro using a 2-core Intel
Xeon can only perform about 5 GFLOPs on single applications. In 2003, the TRIPS team started implementing a prototype chip. Each chip has two complete cores, each one with 16 functional units in a four-wide, four-deep arrangement. In the current implementation, the compiler constructs "hyperblocks" of 128 instructions each, and allows the system to keep eight blocks "in flight" at the same time, for a total of 1,024 instructions per core. The basic design can include up to 32 chips interconnected, approaching 500 GFLOPS. == References ==