blade featuring four SPARC64 VIIIfx processors (under the larger
heat exchangers) The
SPARC64 VIIIfx (
Venus) is an eight-core processor based on the SPARC64 VII designed for
high-performance computing (HPC). As a result, the VIIIfx did not succeed the VII, but existed concurrently with it. It consists of 760 million transistors, measures 22.7 mm by 22.6 (513.02 mm2;), is fabricated in Fujitu's
45 nm CMOS process with copper interconnects, and has 1,271 I/O pins. The VIIIfx has a peak performance at 2 GHz of 128
GFLOPS and a typical power consumption of 58 W at 30 °C for an efficiency of 2.2 GFLOPS/W. The VIIIfx has four integrated
memory controllers for a total of eight
memory channels. It connects to 64 GB of
DDR3 SDRAM and has a peak memory bandwidth of 64 GB/s.
History The VIIIfx was developed for the Next-Generation Supercomputer Project (also called
Kei Soku Keisenki and Project Keisoku) initiated by Japan's
Ministry of Education, Culture, Sports, Science and Technology in January 2006. The project aimed to produce the world's fastest supercomputer with performance of over 10 PFLOPS by March 2011. The companies contracted to develop the supercomputer were Fujitsu,
Hitachi, and
NEC. The supercomputer was originally envisioned to have a hybrid architecture containing
scalar and
vector processors. The Fujitsu-designed VIIIfx was to have been the scalar processor, with the vector processor to have been jointly designed by Hitachi and NEC. However, due to the
2008 financial crisis, Hitachi and NEC announced in May 2009 that they would leave the project because manufacturing the hardware they were responsible for would result in financial losses for them. Afterwards, Fujitsu redesigned the supercomputer to use the VIIIfx as its only processor type. By 2010, the supercomputer that would be built by the project was named the
K computer. Located at the
RIKEN's Advanced Institute for Computational Science (AICS) in
Kobe, Japan; it obtains its performance from 88,128 VIIIfx processors. In June 2011, the
TOP500 Project Committee announced that the K computer (still incomplete with only 68,544 processors) topped the
LINPACK benchmark at 8.162
PFLOPS, realizing 93% of its peak performance, making it the fastest supercomputer in the world at that time.
Description The VIIIfx core is based on that of the SPARC64 VII with numerous modifications for HPC, namely High Performance Computing-Arithmetic Computational Extensions (HPC-ACE) a Fujitsu-designed extension to the SPARC V9 architecture. The front-end had coarse-grained multi-threading removed, the L1 instruction cache halved in size to 32 KB; and the number of branch target address cache (BTAC) entries reduced to 1,024 from 8,192, and its
associativity reduced to two from eight; and an extra pipeline stage was inserted before the instruction decoder. This stage accommodated the greater number of integer and floating-point registers defined by HPC-ACE. The SPARC V9 architecture was designed to have only 32 integer and 32 floating-point number registers. The SPARC V9 instruction encoding limited the number of registers specifiable to 32. To specify the extra registers, HPC-ACE has a "prefix" instruction that would immediately follow one or two SPARC V9 instructions. The prefix instruction contained (primarily) the portions of the register numbers that could not fit within a SPARC V9 instruction. This extra pipeline stage was where up to four SPARC V9 instructions were combined with up to two prefix instructions in the preceding stage. The combined instructions were then decoded in the next pipeline stage. The back-end was also heavily modified. The number of reservation station entries for branch and integer instructions were reduced to six and ten, respectively. Both the integer and floating-point register files had registers added to them: the integer register file gained 32, and there were a total of 256 floating-point registers. The extra integer registers are not part of the
register windows defined by SPARC V9, but are always accessible via the prefix instruction; and the 256 floating-point registers could be used by both scalar floating-point instructions and by both integer and floating-point SIMD instructions. An extra pipeline stage was added to the beginning of the floating-point execution pipeline to access the larger floating-point register file. The 128-bit SIMD instructions from HPC-ACE were implemented by adding two extra floating-point units for a total of four. SIMD execution can perform up four single- or double-precision fused-multiply-add operations (eight FLOPs) per cycle. The number of load queue entries was increased to 20 from 16, and the L1 data cache was halved in size to 32 KB. The number of commit stack entries, which determined the number of instructions that could be in-flight in the back-end, was reduced to 48 from 64.
Miscellaneous specifications • Physical address range: 41 bits • Cache: :* L1: 32
KB two-way
set-associative data, 32 KB two-way set-associative instruction (128-byte cache line), sectored :* L2: 6
MB 12-way set-associative (128-byte line), index-hashed, sectored •
Translation lookaside buffer (TLB): :* A 16-entry micro-TLB; and 256-entry, four-way set-associative TLB for instructions :* A 512-entry, four-way set-associative TLB for data, no victim cache • Page sizes: 8 KB, 64 KB, 512 KB, 4 MB, 32 MB, 256 MB, 2 GB ==SPARC64 IXfx==