Gbcast

Introduced in 1985, Gbcast was the first widely deployed reliable multicast protocol to implement state machine replication with dynamically reconfigurable membership. Although this problem had been treated theoretically under various models in prior work, Gbcast innovated by showing that the same multicasts used to update replicated data within the state machine can also be used to dynamically reconfigure the group membership, which can then evolve to permit members to join and leave at will, in addition to being removed upon failure. This functionality, together with a state transfer mechanism used to initialize joining members, represents the basis of the virtual synchrony process group execution model. The term state machine replication was first suggested by Leslie Lamport and was widely adopted after publication of a survey paper written by Fred B. Schneider. The model covers any system in which some deterministic object (a state machine) is replicated in such a way that a series of commands can be applied to the replicas fault-tolerantly. A reconfigurable state machine is one that can vary its membership, adding new members or removing old ones. Some state machine protocols can also ride out the temporary unavailability of a subset of the current members without requiring reconfiguration when such situations arise, including Gbcast and also Paxos, Lamport's widely cited protocol for state machine replication. State machine replication is closely related to the distributed Consensus problem, in which a collection of processes must agree upon some decision outcome, such as the winner of an election. In particular, it can be shown that any solution to the state machine replication problem would also be capable of solving distributed consensus. As a consequence, impossibility results for distributed consensus apply to solutions to the state machine replication problem. Implications of this finding are discussed under liveness. Gbcast is somewhat unusual in that most solutions to the state machine replication problem are closely integrated with the application being replicated. Gbcast, in contrast, is designed as a multicast API and implemented by a library that delivers messages to group members. Lamport, Malkhi and Zhou note that few reliable multicast protocols have the durability properties required to correctly implement the state machine model. Gbcast does exhibit the necessary properties. The Gbcast protocol was first described in a 1985 publication that discussed infrastructure supporting the virtual synchrony model in the Isis Toolkit. Additional details were provided in a later 1987 journal article, and an open-source version of the protocol was released by the Cornell developers in November of that year. Isis used the protocol primarily for maintaining the membership of process groups but also offered an API that could be called directly by end-users. The technology became widely used starting in 1988, when the Isis system was commercialized and support became available. Commercial support for the system ended in 1998 when Stratus Computer, then the parent of Isis Distributed Systems, refocused purely on hardware solutions for the telecommunications industry. Examples of systems that used Isis in production settings include the New York Stock Exchange, where it was employed for approximately a decade to manage a configurable, fault-tolerant and self-healing reporting infrastructure for the trading floor, to relay quotes and trade reports from the "back office" systems used by the exchange to overhead display. The French Air Traffic Control System continues to use Isis; since 1996 the system has been employed to create fault-tolerant workstation clusters for use by air traffic controllers and to reliably relay routing updates between air traffic control centers; over time the French technology has also been adopted by other European ATC systems. The US Navy AEGIS has used Isis since 1993 to support a reliable and self-healing communication infrastructure. Isis also had several hundred other production users in the financial, telecommunications, process control, SCADA and other critical infrastructure domains. More details can be found in. == Problem statement ==