Message passing layers are permutation-equivariant layers mapping a graph into an updated representation of the same graph. Formally, they can be expressed as message passing neural networks (MPNNs). Let G = (V,E) be a
graph, where V is the node set and E is the edge set. Let N_u be the
neighbourhood of some node u \in V. Additionally, let \mathbf{x}_u be the
features of node u \in V, and \mathbf{e}_{uv} be the features of edge (u, v) \in E. An MPNN
layer can be expressed as follows: :\mathbf{h}_u = \phi \left( \mathbf{x}_u, \bigoplus_{v \in N_u} \psi(\mathbf{x}_u, \mathbf{x}_v, \mathbf{e}_{uv}) \right) where \phi and \psi are
differentiable functions (e.g.,
artificial neural networks), and \bigoplus is a
permutation invariant aggregation operator that can accept an arbitrary number of inputs (e.g., element-wise sum, mean, or max). In particular, \phi and \psi are referred to as
update and
message functions, respectively. Intuitively, in an MPNN computational block, graph nodes
update their representations by
aggregating the
messages received from their neighbours. The outputs of one or more MPNN layers are node representations \mathbf{h}_u for each node u \in V in the graph. Node representations can be employed for any downstream task, such as node/graph
classification or edge prediction. Graph nodes in an MPNN update their representation aggregating information from their immediate neighbours. As such, stacking n MPNN layers means that one node will be able to communicate with nodes that are at most n "hops" away. In principle, to ensure that every node receives information from every other node, one would need to stack a number of MPNN layers equal to the graph
diameter. However, stacking many MPNN layers may cause issues such as oversmoothing and oversquashing. Oversmoothing refers to the issue of node representations becoming indistinguishable. Oversquashing refers to the bottleneck that is created by squeezing long-range dependencies into fixed-size representations. Countermeasures such as skip connections (as in
residual neural networks), gated update rules and jumping knowledge can mitigate oversmoothing. Modifying the final layer to be a fully-adjacent layer, i.e., by considering the graph as a
complete graph, can mitigate oversquashing in problems where long-range dependencies are required. Other "flavours" of MPNN have been developed in the literature, such as graph convolutional networks and graph attention networks, whose definitions can be expressed in terms of the MPNN formalism.
Graph convolutional network The graph convolutional network (GCN) was first introduced by
Thomas Kipf and
Max Welling in 2017. A GCN layer defines a
first-order approximation of a localized spectral
filter on graphs. GCNs can be understood as a generalization of
convolutional neural networks to graph-structured data. The formal expression of a GCN layer reads as follows: :\mathbf{H} = \sigma\left(\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X} \mathbf{\Theta}\right) where \mathbf{H} is the matrix of node representations \mathbf{h}_u, \mathbf{X} is the matrix of node features \mathbf{x}_u, \sigma(\cdot) is an
activation function (e.g.,
ReLU), \tilde{\mathbf{A}} is the graph
adjacency matrix with the addition of self-loops, \tilde{\mathbf{D}} is the graph
degree matrix with the addition of self-loops, and \mathbf{\Theta} is a matrix of trainable parameters. In particular, let \mathbf{A} be the graph adjacency matrix: then, one can define \tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I} and \tilde{\mathbf{D}}_{ii} = \sum_{j \in V} \tilde{A}_{ij}, where \mathbf{I} denotes the
identity matrix. This normalization ensures that the
eigenvalues of \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} are bounded in the range [0, 1], avoiding
numerical instabilities and
exploding/vanishing gradients. A limitation of GCNs is that they do not allow multidimensional edge features \mathbf{e}_{uv}. It is however possible to associate scalar weights w_{uv} to each edge by imposing A_{uv} = w_{uv}, i.e., by setting each nonzero entry in the adjacency matrix equal to the weight of the corresponding edge.
Graph attention network The graph attention network (GAT) was introduced by
Petar Veličković et al. in 2018. A graph attention network is a combination of a GNN and an attention layer. The implementation of attention layer in graphical neural networks helps provide attention or focus to the important information from the data instead of focusing on the whole data. A multi-head GAT layer can be expressed as follows: : \mathbf{h}_u = \overset{K}{\underset{k=1}{\Big\Vert}} \sigma \left(\sum_{v \in N_u} \alpha_{uv} \mathbf{W}^k \mathbf{x}_v\right) where K is the number of
attention heads, \Big\Vert denotes
vector concatenation, \sigma(\cdot) is an
activation function (e.g.,
ReLU), \alpha_{ij} are attention coefficients, and W^k is a matrix of trainable parameters for the k-th attention head. For the final GAT layer, the outputs from each attention head are averaged before the application of the activation function. Formally, the final GAT layer can be written as: : \mathbf{h}_u = \sigma \left(\frac{1}{K}\sum_{k=1}^K \sum_{v \in N_u} \alpha_{uv} \mathbf{W}^k \mathbf{x}_v\right)
Attention in Machine Learning is a technique that mimics
cognitive attention. In the context of learning on graphs, the attention coefficient \alpha_{uv} measures
how important is node u \in V to node v \in V. Normalized attention coefficients are computed as follows: :\alpha_{uv} = \frac{\exp(\text{LeakyReLU}\left(\mathbf{a}^T [\mathbf{W} \mathbf{x}_u \Vert \mathbf{W} \mathbf{x}_v \Vert \mathbf{e}_{uv}]\right))}{\sum_{z \in N_u}\exp(\text{LeakyReLU}\left(\mathbf{a}^T [\mathbf{W} \mathbf{x}_u \Vert \mathbf{W} \mathbf{x}_z \Vert \mathbf{e}_{uz}]\right))} where \mathbf{a} is a vector of learnable weights, \cdot^T indicates
transposition, \mathbf{e}_{uv} are the edge features (if present), and \text{LeakyReLU} is a
modified ReLU activation function. Attention coefficients are normalized to make them easily comparable across different nodes. A GCN can be seen as a special case of a GAT where attention coefficients are not learnable, but fixed and equal to the edge weights w_{uv}.
Gated graph sequence neural network The gated graph sequence neural network (GGS-NN) was introduced by
Yujia Li et al. in 2015. The GGS-NN extends the GNN formulation by Scarselli et al. to output sequences. The message passing framework is implemented as an update rule to a
gated recurrent unit (GRU) cell. A GGS-NN can be expressed as follows: :\mathbf{h}_u^{(0)} = \mathbf{x}_u \, \Vert \, \mathbf{0} :\mathbf{m}_u^{(l+1)} = \sum_{v \in N_u} \mathbf{\Theta} \mathbf{h}_v :\mathbf{h}_u^{(l+1)} = \text{GRU}(\mathbf{m}_u^{(l+1)}, \mathbf{h}_u^{(l)}) where \Vert denotes
vector concatenation, \mathbf{0} is a vector of zeros, \mathbf{\Theta} is a matrix of learnable parameters, \text{GRU} is a GRU cell, and l denotes the sequence index. In a GGS-NN, the node representations are regarded as the hidden states of a GRU cell. The initial node features \mathbf{x}_u^{(0)} are
zero-padded up to the hidden state dimension of the GRU cell. The same GRU cell is used for updating representations for each node. == Local pooling layers ==