The R5000 is a two-way
superscalar design that executes instructions
in-order. The R5000 could simultaneously issue an integer and a floating-point instruction. It had one simple
pipeline for integer instructions and another for floating-point to save transistors and die area to reduce cost. The R5000 did not perform
dynamic branch prediction for cost reasons. Instead it uses a static approach, utilizing the hints encoded by the
compiler in the branch-likely instructions first introduced in the MIPS II architecture to determine how likely a branch is taken. The R5000 had large L1
caches, a distinct characteristic of QED, whose designers favored simple designs with large caches. The R5000 had two L1 caches, one for instructions and the other for data. Both have a capacity of 32 KB. The caches are two-way
set-associative, have a 32-byte line size, and are
virtually indexed, physically tagged. Instructions were predecoded as they enter the instruction cache by appending four bits to each instruction. These four bits specify whether they can be issued together and which execution unit they are executed by. This assisted superscalar instruction issue by moving some of the dependency and conflict checking out of the critical path. The integer unit executes most instructions with a one cycle latency and throughput except for multiply and divide. 32-bit multiplies have a five-cycle latency and a four-cycle throughput. 64-bit multiplies have an extra four cycles of latency and half the throughput. Divides have a 36-cycle latency and throughput for 32-bit integers, and for 64-bit integers, they are increased to 68 cycles. The
floating-point unit (FPU) was a fast single-precision (32-bit) design, for reduced cost and to benefit SGI, whose mid-range 3D graphics workstations relied mostly on single-precision math for 3D graphics applications. It was fully pipelined, which made it significantly better than that of the
R4700. The R5000 implements the multiply-add instruction of the MIPS IV ISA. Single-precision adds, multiplies and multiply-adds have a four-cycle latency and a one cycle throughput. Single-precision divides, reciprocals and square roots have a 21-cycle latency and a 19-cycle throughput, while reciprocal square roots have a 38-cycle latency and a 36-cycle throughput (all of these are not pipelined). Instructions that operate on double precision numbers have a significantly higher latency and lower throughput except for add, which has identical latency and throughput with single-precision add. Multiply and multiply-add have a five-cycle latency and a two-cycle throughput. Divide has a 36-cycle latency and a 34-cycle throughput. Reciprocal square root has a 68-cycle latency and a 66-cycle throughput. The R5000 had an integrated L2 cache controller that supported capacities of 512 KB, 1 MB and 2 MB. The L2 cache shares the SysAD bus with the external interface. The cache was built with custom synchronous SRAMs (SSRAMs). The microprocessor uses the SysAD
bus that is also used by several other MIPS microprocessors. The bus is
multiplexed (address and data share the same set of wires) and can operate at clock frequencies up to 100 MHz. The initial R5000 did not support
multiprocessing, but the package reserved eight pins for the future addition of this feature. QED was a fabless company and did not fabricate their own designs. The R5000 was fabricated by IDT, NEC and NKK. All three companies fabricated the R5000 in a 0.35 μm
complementary metal–oxide–semiconductor (CMOS) process, but with different process features. IDT fabricated the R5000 in a process with two levels of polysilicon and three levels of
aluminium interconnect. The two levels of polysilicon enabled IDT to use a four-transistor SRAM cell, resulting in a transistor count of 3.6 million and a die that measured 8.7 mm by 9.7 mm (84.39 mm2). NEC and NKK fabricated the R5000 in a process with one level of polysilicon and three levels of aluminium interconnect. Without an extra level of polysilicon, both companies had to use a six-transistor SRAM cell, resulting in a transistor count of 5.0 million and a larger die with an area of around 87 mm2. Die sizes in the range of 80 to 90 mm2 were claimed by MTI. 0.8 million of the transistors in both versions were for logic, and the remainder contained in the caches. It was packaged in a 272-ball plastic
ball grid array (BGA) or 223-pin ceramic
pin grid array (PGA). It was not pin-compatible with any previous MIPS microprocessor. ==Derivatives==