The fundamental problem solved by Gbcast is this: we are given an initial set of
group members and wish to support a multicast abstraction, permitting members of the group to send messages that encode various commands or requests. The protocol must agree on the messages to deliver, and on their ordering, so that if any member of the group sends a message, every member of the group that doesn't fail will receive that message and in the same order with respect to other delivered messages. The set of group members changes each time a member fails or joins, and Gbcast is also used to maintain group membership by means of special multicasts that are delivered to the application as "new view" events, but that also adjust the group membership list maintained by the Gbcast protocol library. The application thus sees a series of membership views that start with an "initial view" when a particular group member joins, and then evolve over time, and that are ordered with respect to other view-changing events and multicast messages. These multicasts are delivered to all the non-failed members listed in the view during which delivery is scheduled, a property referred to as virtual synchrony. Network partitions can split a group into two or more disjoint subgroups, creating the risk of
split brain behavior, in which some group members take a decision (perhaps, to launch the rocket) without knowing that some other partition of the group has taken a different, conflicting decision. Gbcast offers protection against this threat: the protocol ensures that progress occurs only in a single
primary partition of the group. Thus, should a
network partition arise, at most one subgroup of members will continue operations, while the other is certain to stall and shut down. Should a failed member recover (or if a partitioning failure caused some member to be incorrectly sensed as faulty and hence dropped from the view), after communication is restored, that member can rejoin. An
incarnation number is used to avoid ambiguity: a counter that will be incremented each time a process joins the group, and is treated as part of the
process identifier. Any given (processor-id, process-id, incarnation-number) tuple joins the group at most once, then remains in the group until it fails, or is forced to leave because a time out occurred. Any dynamically reconfigurable system, including both Gbcast and Paxos, can enter states from which no further progress is possible. For example, this could happen if operational processes are wrongly removed from the configuration, and then too many real crashes occur within the remaining members of the view. In such situations, the data center management infrastructure is responsible for restarting the entire application. This is in contrast to the behavior of non-reconfigurable (
vanilla) Paxos, which can tolerate disruptions of unlimited duration and then will resume once enough group members are accessible, without intervention of the management infrastructure. The following terms are used in the detailed protocol description.
Processes • Processes run on processors that operate at arbitrary speed. • Processes may experience crash (halting) failures. • A process is uniquely identified by a three-tuple: (processor-id, process-id, incarnation-number). • Processes with
stable storage may re-join the protocol after failures (following a crash-recovery failure model), after incrementing the incarnation number. • Processes do not collude, lie, or otherwise attempt to subvert the protocol. (That is, Byzantine failures don't occur.)
Network • All processes in the system can send messages to all other processes in the system. • Messages are sent asynchronously: there is no time bound on message delivery. • Messages may be lost, reordered, or duplicated. • Messages are delivered without corruption. These are weak assumptions: a network that never delivers any messages would satisfy them (we would say that such a network is experiencing a complete and permanent
partitioning failure). The network conditions required for Gbcast to guarantee progress are discussed below. In practice Gbcast is normally used within data centers; these have networks that can experience transient failures, but in which partitioning failures are rare, and generally impact just small subsets of the nodes. Thus for purposes of analysis we assume a harsher networking environment than would arise in actual deployments. To simplify the presentation, we assume that a
TCP-like acknowledgement / retransmission scheme is employed, creating the illusion of a reliable, sequenced, non-repeating message channel between each pair of processes. A
timeout occurs if this channel abstraction retries repeatedly and is unable to obtain an acknowledgement for some message. Using the same TCP-like channels, we can also support a 1-to-all capability, whereby a single process sends some message over its channels to all the other members of some view of some group. This is done by mapping the 1-to-all request into multiple 1-to-1 messages. Notice that these 1-to-all channels lack any atomicity guarantee: if the sender fails while a message is being sent, it might reach just some of the destinations.
Process Groups and Views :Gbcast is defined with respect to a "process group:" a set of processes. In a deployed system such a group might have a name (like a file name), a way to initially contact the group, and other attributes such as flow-control parameters. However, those kinds of details are omitted here for brevity. :The term
membership view is a list of members, rank-ordered by age (determined by the view in which each member most recently joined the group) and with ties broken by a lexicographic ordering rule. :The initial membership of the group is specified by an external agent and defines the first membership view of the group. :Subsequent membership views arise by applying
add and
remove commands and are identified by sequence number. :New views are reported to the processes belonging to the view by means of "new view" events. The application is notified via an
upcall (a call from the library into a handler defined by the application program).
Multicast Messages :Members of a view can request that multicast messages be sent to a process group without knowledge of the membership that will apply at the time of delivery. :The Gbcast protocol carries out these operations with a series of guarantees, discussed below. :Delivery is by upcall to the application, which can perform whatever action the message requests.
Roles Gbcast is best understood in terms of a set of roles.
Application :An application corresponds to a program which can be launched on one or more processors. Each application process then joins one or more process groups. :An application process belonging to a group initiates new multicasts by invoking Gbcast. The protocol is considered to have terminated when all members of the target group have either acknowledged delivery of the message, or have been detected as faulty, via a mechanism explained below. :Incoming Gbcast messages are delivered via upcalls, as are view change notifications. :As noted earlier, the members of a group observe the same sequence of upcalls starting when they initially join: an initial view and then a sequence of new views and multicast messages. All members of a group receive any particular multicast in the same view, and the multicast is delivered to all non-failed members of that view.
Leader :The leader of a group is defined with respect to some view of the group, and is the member with lowest rank in the view. As noted, the rank is age-ordered (with older members having lower rank), and ties are broken using a lexicographic sort.
Failure detection :All components of the system are permitted to participate in the role of "detecting" failures. Detection is distinct from the
reporting of the failure (which occurs through a new view and is ordered with respect to message deliveries). :The channel abstraction supported by the network layer senses failures by timeouts. (Notice that under the network model, a process that attempts to send a message to a crashed target process will always experience a timeout, but it is also possible that the channel abstraction could misreport an operational process as faulty if messages are delayed because of a transient partitioning failure). :Any process that experiences a timeout can declare that the endpoint of the associated channel has failed. :If a process learns of a failure for some (processor-id, process-id, incarnation-number) tuple, it includes that information on the next outgoing message on all channels. :A process that considers some other process to have failed will reject messages from the failed incarnation, responding "you have failed". (That is, processes gossip about failures, and shun failed group members). :An incoming message from a new incarnation of a failed process is treated as a message from a "new" process.
Failed process :Any member of the current view that has been detected as failed is considered to be a
failed process. :An operational process that learns that it is considered to have failed (by attempting to communicate with some other process that rejects the message, thereby "shunning" it) might exit from the system, or can increase its incarnation number and rejoin.
New Leader :If every lower-ranked process in the current view is a failed process, then the next highest-ranked non-failed process is designated as the new leader. :The new leader must run a protocol, discussed below, to become the leader.
Quorums Quorums are used to guarantee the safety properties of Gbcast by ensuring that there is a single globally agreed-upon sequence of group views and multicast messages and by preventing progress in more than one partition if a group becomes fragmented into two or more partitions (disjoint subsets of members that can communicate with other members of their subsets, but not with members of other subsets). Quorums are defined for a specific view. Given view
i with
n members {A,B,C….}, a quorum of the view is any majority subset of the members of that view. Notice that this is in contrast to the way the term is defined in systems that have a static underlying membership: for Gbcast, the quorum size will change over time as the membership of a group changes and new views become defined. ==Safety and liveness properties==