BGP neighbors, called peers, are established by manual configuration among
routers to create a
TCP session on
port 179. A BGP speaker sends 19-byte keep-alive messages every 30 seconds (protocol default value, tunable) to maintain the connection. Among routing protocols, BGP is unique in using TCP as its transport protocol. When BGP runs between two peers in the same
autonomous system (AS), it is referred to as
Internal BGP (
iBGP or
Interior Border Gateway Protocol). When it runs between different autonomous systems, it is called
External BGP (
eBGP or
Exterior Border Gateway Protocol). Routers on the boundary of one AS exchanging information with another AS are called
border or
edge routers or simply
eBGP peers and are typically connected directly, while
iBGP peers can be interconnected through other intermediate routers. Other deployment
topologies are also possible, such as running eBGP
peering inside a
VPN tunnel, allowing two remote sites to exchange routing information in a secure and isolated manner. The main difference between iBGP and eBGP peering is in the way routes that were received from one peer are typically propagated by default to other peers: • New routes learned from an eBGP peer are re-advertised to all iBGP and eBGP peers. • New routes learned from an iBGP peer are re-advertised to all eBGP peers only. These route-propagation rules effectively require that all iBGP peers inside an AS are interconnected in a full mesh with iBGP sessions. How routes are propagated can be controlled in detail via the
route-maps mechanism. This mechanism consists of a set of rules. Each rule describes, for routes matching some given criteria, what action should be taken. The action could be to drop the route, or it could be to modify some attributes of the route before inserting it in the routing table.
Extensions negotiation During the peering handshake, when OPEN messages are exchanged, BGP speakers can negotiate optional capabilities of the session, including
multiprotocol extensions and various recovery modes. If the multiprotocol extensions to BGP are negotiated at the time of creation, the BGP speaker can prefix the Network Layer Reachability Information (NLRI) it advertises with an address family prefix. These families include the IPv4 (default), IPv6, IPv4/IPv6 Virtual Private Networks and multicast BGP. Increasingly, BGP is used as a generalized signaling protocol to carry information about routes that may not be part of the global Internet, such as VPNs. In order to make decisions in its operations with peers, a BGP peer uses a simple
finite-state machine (FSM) that consists of six states: Idle; Connect; Active; OpenSent; OpenConfirm; and Established. For each peer-to-peer session, a BGP implementation maintains a state variable that tracks which of these six states the session is in. The BGP defines the messages that each peer should exchange in order to change the session from one state to another. The first state is the Idle state. In the Idle state, BGP initializes all resources, refuses all inbound BGP connection attempts and initiates a TCP connection to the peer. The second state is Connect. In the Connect state, the router waits for the TCP connection to complete and transitions to the OpenSent state if successful. If unsuccessful, it starts the ConnectRetry timer and transitions to the Active state upon expiration. In the Active state, the router resets the ConnectRetry timer to zero and returns to the Connect state. In the OpenSent state, the router sends an Open message and waits for one in return in order to transition to the OpenConfirm state. Keepalive messages are exchanged and, upon successful receipt, the router is placed into the Established state. In the Established state, the router can send and receive: Keepalive; Update; and Notification messages to and from its peer. •
Idle State: • Refuse all incoming BGP connections. • Start the initialization of event triggers. • Initiates a TCP connection with its configured BGP peer. • Listens for a TCP connection from its peer. • Changes its state to Connect. • If an error occurs at any state of the FSM process, the BGP session is terminated immediately and returned to the Idle state. Some of the reasons why a router does not progress from the Idle state are: • TCP port 179 is not open. • A random TCP port over 1023 is not open. • Peer address configured incorrectly on either router. • AS number configured incorrectly on either router. •
Connect State: • Waits for successful TCP negotiation with peer. • BGP does not spend much time in this state if the TCP session has been successfully established. • Sends Open message to peer and changes state to OpenSent. • If an error occurs, BGP moves to the Active state. Some reasons for the error are: • TCP port 179 is not open. • A random TCP port over 1023 is not open. • Peer address configured incorrectly on either router. • AS number configured incorrectly on either router. •
Active State: • If the router was unable to establish a successful TCP session, then it ends up in the Active state. • BGP FSM tries to restart another TCP session with the peer and, if successful, then it sends an Open message to the peer. • If it is unsuccessful again, the FSM is reset to the Idle state. • Repeated failures may result in a router cycling between the Idle and Active states. Some of the reasons for this include: • TCP port 179 is not open. • A random TCP port over 1023 is not open. • BGP configuration error. •
Network congestion. • Flapping network interface. •
OpenSent State: • BGP FSM listens for an Open message from its peer. • Once the message has been received, the router checks the validity of the Open message. • If there is an error it is because one of the fields in the Open message does not match between the peers, e.g., BGP version mismatch, the peering router expects a different My AS, etc. The router then sends a Notification message to the peer indicating why the error occurred. • If there is no error, a Keepalive message is sent, various timers are set and the state is changed to OpenConfirm. •
OpenConfirm State: • The peer is listening for a Keepalive message from its peer. • If a Keepalive message is received and no timer has expired before reception of the Keepalive, BGP transitions to the Established state. • If a timer expires before a Keepalive message is received, or if an error condition occurs, the router transitions back to the Idle state. •
Established State: • In this state, the peers send Update messages to exchange information about each route being advertised to the BGP peer. • If there is any error in the Update message then a Notification message is sent to the peer, and BGP transitions back to the Idle state.
Router connectivity and learning routes In the simplest arrangement, all routers within a single AS and participating in BGP routing must be configured in a full mesh: each router must be configured as a peer to every other router. This causes scaling problems, since the number of required connections
grows quadratically with the number of routers involved. To alleviate the problem, BGP implements two options:
route reflectors (RFC 4456) and
BGP confederations (RFC 5065). The following discussion of basic update processing assumes a full iBGP mesh. A given BGP router may accept
network-layer reachability information (NLRI) updates from multiple neighbors and advertise NLRI to the same, or a different set, of neighbors. The BGP process maintains several
routing information bases: • RIB: routers main routing information base table. • Loc-RIB: local routing information base BGP maintains its own master routing table separate from the main routing table of the router. • Adj-RIB-In: For each neighbor, the BGP process maintains a conceptual
adjacent routing information base, incoming, containing the NLRI received from the neighbor. • Adj-RIB-Out: For each neighbor, the BGP process maintains a conceptual
adjacent routing information base, outgoing , containing the NLRI sent to the neighbor. The physical storage and structure of these conceptual tables are decided by the implementer of the BGP code. Their structure is not visible to other BGP routers, although they usually can be interrogated with management commands on the local router. It is quite common, for example, to store the Adj-RIB-In, Adj-RIB-Out and the Loc-RIB together in the same
data structure, with additional information attached to the RIB entries. The additional information tells the BGP process such things as whether individual entries belong in the Adj-RIBs for specific neighbors, whether the peer-neighbor route selection process made received policies eligible for the Loc-RIB, and whether Loc-RIB entries are eligible to be submitted to the local router's routing table management process. BGP submits the routes that it considers best to the main routing table process. Depending on the implementation of that process, the BGP route is not necessarily selected. For example, a directly connected prefix, learned from the router's own hardware, is usually most preferred. As long as that directly connected route's interface is active, the BGP route to the destination will not be put into the routing table. Once the interface goes down, and there are no more preferred routes, the Loc-RIB route would be installed in the main routing table. BGP carries the information with which rules inside BGP-speaking routers can make policy decisions. Some of the information carried that is explicitly intended to be used in policy decisions are: •
Communities •
multi-exit discriminators (MED). •
autonomous systems (AS)
Route selection process The BGP standard specifies a number of decision factors, more than the ones that are used by any other common routing process, for selecting NLRI to go into the Loc-RIB. The first decision point for evaluating NLRI is that its next-hop attribute must be reachable (or resolvable). Another way of saying the next-hop must be reachable is that there must be an active route, already in the main routing table of the router, to the prefix in which the next-hop address is reachable. Next, for each neighbor, the BGP process applies various standard and implementation-dependent criteria to decide which routes conceptually should go into the Adj-RIB-In. The neighbor could send several possible routes to a destination, but the first level of preference is at the neighbor level. Only one route to each destination will be installed in the conceptual Adj-RIB-In. This process will also delete, from the Adj-RIB-In, any routes that are withdrawn by the neighbor. Whenever a conceptual Adj-RIB-In changes, the main BGP process decides if any of the neighbor's new routes are preferred to routes already in the Loc-RIB. If so, it replaces them. If a given route is withdrawn by a neighbor, and there is no other route to that destination, the route is removed from the Loc-RIB and no longer sent by BGP to the main routing table manager. If the router does not have a route to that destination from any non-BGP source, the withdrawn route will be removed from the main routing table. As long as there is a
tie, the route selection process moves to the next step. The local preference, weight, and other criteria can be manipulated by local configuration and software capabilities. Such manipulation, although commonly used, is outside the scope of the standard. For example, the
community attribute (see below) is not directly used by the BGP selection process. The BGP neighbor process can have a rule to set local preference or another factor based on a manually programmed rule to set the attribute if the community value matches some pattern-matching criterion. If the route was learned from an external peer the per-neighbor BGP process computes a local preference value from local policy rules and then compares the local preference of all routes from the neighbor.
Communities BGP communities are attribute tags that can be applied to incoming or outgoing prefixes to achieve some common goal. While it is common to say that BGP allows an administrator to set policies on how prefixes are handled by ISPs, this is generally not possible, strictly speaking. For instance, BGP natively has no concept to allow one AS to tell another AS to restrict advertisement of a prefix to only North American peering customers. Instead, an ISP generally publishes a list of well-known or proprietary communities with a description for each one, which essentially becomes an agreement of how prefixes are to be treated. Examples of common communities include: • local preference adjustments, • geographic • peer type restrictions •
denial-of-service attack identification • AS prepending options. An ISP might state that any routes received from customers with following examples: • To Customers North America (East Coast) 3491:100 • To Customers North America (West Coast) 3491:200 The customer simply adjusts their configuration to include the correct community or communities for each route, and the ISP is responsible for controlling who the prefix is advertised to. The end user has no technical ability to enforce correct actions being taken by the ISP, though problems in this area are generally rare and accidental. It is a common tactic for end customers to use BGP communities (usually ASN:70,80,90,100) to control the local preference the ISP assigns to advertised routes instead of using MED (the effect is similar). The community attribute is transitive, but communities applied by the customer very rarely propagate outside the next-hop AS. Not all ISPs give out their communities to the public.
BGP Extended Community Attribute The BGP Extended Community Attribute was added in 2006, in order to extend the range of such attributes and to provide a community attribute structuring by means of a type field. The extended format consists of one or two octets for the type field followed by seven or six octets for the respective community attribute content. The
IANA administers the registry for BGP Extended Communities Types. The Extended Communities Attribute itself is a transitive optional BGP attribute. A bit in the 'Type' field within the attribute decides whether the encoded extended community is of a transitive or non-transitive nature. The IANA registry therefore provides different number ranges for the attribute types. Due to the extended attribute range, its usage can be manifold. RFC 4360 exemplarily defines the "Two-Octet AS Specific Extended Community", the "IPv4 Address Specific Extended Community", the "Opaque Extended Community", the "Route Target Community", and the "Route Origin Community". A number of BGP QoS drafts also use this Extended Community Attribute structure for inter-domain QoS signalling. With the introduction of 32-bit AS numbers, some issues were immediately obvious with the community attribute that only defines a 16-bit ASN field, which prevents the matching between this field and the real ASN value. Since 2014, extended communities are compatible with 32-bit ASNs. To accommodate 32-bit AS numbers in BGP Communities, a Large Community attribute of 12 bytes was defined, divided in three field of 4 bytes each (AS:function:parameter).
Multi-exit discriminators MEDs, defined in the main BGP standard, were originally intended to show to another neighbor AS the advertising AS's preference as to which of several links are preferred for inbound traffic. Another application of MEDs is to advertise the value, typically based on delay, of multiple ASs that have a presence at an
IXP, that they impose to send traffic to some destination. Some routers (like Juniper) will use the Metric from OSPF to set MED.
Examples of MED used with BGP when exported to BGP on Juniper SRX • run show ospf route Topology default Route Table: Prefix Path Route NH Metric NextHop Nexthop Type Type Type Interface Address/LSP 10.32.37.0/24 Inter Discard IP 16777215 10.32.37.0/26 Intra Network IP 101 ge-0/0/1.0 10.32.37.241 10.32.37.64/26 Intra Network IP 102 ge-0/0/1.0 10.32.37.241 10.32.37.128/26 Intra Network IP 101 ge-0/0/1.0 10.32.37.241 • show route advertising-protocol bgp 10.32.94.169 Prefix Nexthop MED Lclpref AS path • 10.32.37.0/24 Self 16777215 I • 10.32.37.0/26 Self 101 I • 10.32.37.64/26 Self 102 I • 10.32.37.128/26 Self 101 I == Packet format ==