A video presentation of the history and technology of the Blue Gene project was given at the Supercomputing 2020 conference. In December 1999, IBM announced a US$100 million research initiative for a five-year effort to build a massively
parallel computer, to be applied to the study of biomolecular phenomena such as
protein folding. The research and development was pursued by a large multi-disciplinary team at the
IBM T. J. Watson Research Center, initially led by
William R. Pulleyblank. The project had two main goals: to advance understanding of the mechanisms behind protein folding via large-scale simulation, and to explore novel ideas in massively parallel machine architecture and software. Major areas of investigation included: how to use this novel platform to effectively meet its scientific goals, how to make such massively parallel machines more usable, and how to achieve performance targets at a reasonable cost, through novel machine architectures. The initial design for Blue Gene was based on an early version of the
Cyclops64 architecture, designed by
Monty Denneau. In parallel, Alan Gara had started working on an extension of the
QCDOC architecture into a more general-purpose supercomputer. The
US Department of Energy started funding the development of this system and it became known as Blue Gene/L (L for Light). Development of the original Blue Gene architecture continued under the name Blue Gene/C (C for Cyclops) and, later, Cyclops64. Architecture and chip logic design for the Blue Gene systems was done at the
IBM T. J. Watson Research Center, chip design was completed and chips were manufactured by
IBM Microelectronics, and the systems were built at
IBM Rochester, MN. Alan Gara was the Chief Architect and
Paul Coteus was the Chief Engineer. In November 2004 a 16-
rack system, with each rack holding 1,024 compute nodes, achieved first place in the
TOP500 list, with a
LINPACK benchmarks performance of 70.72 TFLOPS. gradually expanded to 104 racks, achieving 478 TFLOPS Linpack and 596 TFLOPS peak. The LLNL BlueGene/L installation held the first position in the TOP500 list for 3.5 years, until in June 2008 it was overtaken by IBM's Cell-based
Roadrunner system at
Los Alamos National Laboratory, which was the first system to surpass the 1 PetaFLOPS mark. While the LLNL installation was the largest Blue Gene/L installation, many smaller installations followed. The November 2006
TOP500 list showed 27 computers with the
eServer Blue Gene Solution architecture. For example, three racks of Blue Gene/L were housed at the
San Diego Supercomputer Center. While the
TOP500 measures performance on a single benchmark application, Linpack, Blue Gene/L also set records for performance on a wider set of applications. Blue Gene/L was the first supercomputer ever to run over 100
TFLOPS sustained on a real-world application, namely a three-dimensional molecular dynamics code (ddcMD), simulating solidification (nucleation and growth processes) of molten metal under high pressure and temperature conditions. This achievement won the 2005
Gordon Bell Prize. In June 2006,
NNSA and IBM announced that Blue Gene/L achieved 207.3 TFLOPS on a quantum chemical application (
Qbox). At Supercomputing 2006, Blue Gene/L was awarded the winning prize in all HPC Challenge Classes of awards. In 2007, a team from the
IBM Almaden Research Center and the
University of Nevada ran an
artificial neural network almost half as complex as the brain of a mouse for the equivalent of a second (the network was run at 1/10 of normal speed for 10 seconds).
The name The name Blue Gene comes from what it was originally designed to do, help biologists understand the processes of
protein folding and
gene development. "Blue" is a traditional moniker that IBM uses for many of its products and
the company itself. The original Blue Gene design was renamed "Blue Gene/C" and eventually
Cyclops64. The "L" in Blue Gene/L comes from "Light" as that design's original name was "Blue Light". The "P" version was designed to be a
petascale design. "Q" is just the letter after "P".
Major features The Blue Gene/L supercomputer was unique in the following aspects: • Trading the speed of processors for lower power consumption. Blue Gene/L used low frequency and low power embedded PowerPC cores with floating-point accelerators. While the performance of each chip was relatively low, the system could achieve better power efficiency for applications that could use large numbers of nodes. • Dual processors per node with two working modes: co-processor mode where one processor handles computation and the other handles communication; and virtual-node mode, where both processors are available to run user code, but the processors share both the computation and the communication load. • System-on-a-chip design. Components were embedded on a single chip for each node, with the exception of 512 MB external DRAM. • A large number of nodes (scalable in increments of 1024 up to at least 65,536). • Three-dimensional
torus interconnect with auxiliary networks for global communications (broadcast and reductions), I/O, and management. • Lightweight OS per node for minimum system overhead (system noise).
Architecture The Blue Gene/L architecture was an evolution of the QCDSP and
QCDOC architectures. Each Blue Gene/L Compute or I/O node was a single
ASIC with associated
DRAM memory chips. The ASIC integrated two 700 MHz
PowerPC 440 embedded processors, each with a double-pipeline-double-precision
Floating-Point Unit (FPU), a
cache sub-system with built-in DRAM controller and the logic to support multiple communication sub-systems. The dual FPUs gave each Blue Gene/L node a theoretical peak performance of 5.6
GFLOPS (gigaFLOPS). The two CPUs were not
cache coherent with one another. Compute nodes were packaged two per compute card, with 16 compute cards (thus 32 nodes) plus up to 2 I/O nodes per node board. A cabinet/rack contained 32 node boards. By the integration of all essential sub-systems on a single chip, and the use of low-power logic, each Compute or I/O node dissipated about 17 watts (including DRAMs). The low power per node allowed aggressive packaging of up to 1024 compute nodes, plus additional I/O nodes, in a standard
19-inch rack, within reasonable limits on electrical power supply and air cooling. The system performance metrics, in terms of
FLOPS per watt, FLOPS per m2 of floorspace and FLOPS per unit cost, allowed scaling up to very high performance. With so many nodes, component failures were inevitable. The system was able to electrically isolate faulty components, down to a granularity of half a rack (512 compute nodes), to allow the machine to continue to run. Each Blue Gene/L node was attached to three parallel communications networks: a
3D toroidal network for peer-to-peer communication between compute nodes, a collective network for collective communication (broadcasts and reduce operations), and a global interrupt network for
fast barriers. The I/O nodes, which run the
Linux operating system, provided communication to storage and external hosts via an
Ethernet network. The I/O nodes handled filesystem operations on behalf of the compute nodes. A separate and private
Ethernet management network provided access to any node for configuration,
booting and diagnostics. To allow multiple programs to run concurrently, a Blue Gene/L system could be partitioned into electronically isolated sets of nodes. The number of nodes in a partition had to be a positive
integer power of 2, with at least 25 = 32 nodes. To run a program on Blue Gene/L, a partition of the computer was first to be reserved. The program was then loaded and run on all the nodes within the partition, and no other program could access nodes within the partition while it was in use. Upon completion, the partition nodes were released for future programs to use. Blue Gene/L compute nodes used a minimal
operating system supporting a single user program. Only a subset of
POSIX calls was supported, and only one process could run at a time on a node in co-processor mode—or one process per CPU in virtual mode. Programmers needed to implement
green threads in order to simulate local concurrency. Application development was usually performed in
C,
C++, or
Fortran using
MPI for communication. However, some scripting languages such as
Ruby and
Python have been ported to the compute nodes. IBM published BlueMatter, the application developed to exercise Blue Gene/L, as open source. This serves to document how the torus and collective interfaces were used by applications, and may serve as a base for others to exercise the current generation of supercomputers. ==Blue Gene/P==