What is Graph Attention Network?

Source: towardsdatascience.com

A data structure consisting of nodes is called a graph, which for simplicity is represented as points on a two-dimensional plane or in three-dimensional space, and edges connecting the nodes. You may have heard of this technology but didn’t know what it is.

Graphs can be directed or undirected, have cycles, be multigraphs, etc. We won’t dwell on all of this to save time, but feel free to take a closer look at it.

The most important thing to know about graphs is that they are widely used. Using graphs, different spheres and industries, represent the connections or relations between some essences. For example, molecules can be represented as graphs, social networks, roads in the city, and even images can be decomposed as graphs if we will show the connections of the nearest pixels.

We have to highlight the important point about graphs: their flexibility and complicated structures. For example, different nodes may differ significantly in the number of neighbors they have. This makes it difficult to use graphs as input for neural networks, which do a pretty good job with tabular or at least some fixed structured data.

Source: frontiersin.org

Check serokell.io/blog  to familiarize yourself with deep learning and NNs methodologies.

Let’s move on to the topic of this article – machine learning on graphs.

Imagine that each node has its feature vector, some label, and we are trying to predict this label based only on the graph topology and feature vectors. Sometimes we perform regression, classification, or parameter tuning on the graph, but with all of that, we are trying to predict its label.

But before studying graph machine learning, let’s discuss the methods that were used before. These methods are usually difficult to compute, they require their own Laplace base. This problem was later solved by Michael Defferard. However, since the correct base of the Laplacian strongly depends on the graph structure, it is difficult to apply the trained model to a different graph.

Source: towardsdatascience.com

Spatial methods work and define convolutions directly on the chart. In principle, these methods can be as follows:

  1. Create a representation of feature vectors. This can be done by multiplying them with some weight matrix.
  2. Combine representations over a subset of nodes, which are different for every individual node.
  3. The feature vector of each node is updated using the combination from the previous stage.

In GraphSAGE, node representations are updated by aggregating feature vectors of a sampled fixed size neighborhood of each node. This combination can be, for example, just an average value.

We missed, on purpose, some important things, so we recommend reading more on the topic if you want to go deeper.

Let’s summarize everything we have at the moment:

  1. We have some graphs, each node has a feature vector, and our goal is to predict the labels of each node.
  2. We can use some complex mathematical approaches that are difficult to calculate or that cannot be retranslated to graphs with a different structure.
  3. We can generalize the convolution and apply it directly to graphs.
  4. However, our generalization is limited. We know how to create vector representations of objects for each node and to update them using aggregation over some neighborhoods, but we can’t find the optimal aggregation function yet.
Source: github.com

Graphic Attention Network

An attention-based architecture is used to classify data nodes structured by the graph. Nowadays, attention patterns are everywhere, but that wasn’t quite the case in 2018. The best thing about attention-based models is that they can work with variable-sized input data and focus on the most significant parts.

The main idea is to allow attention layers to study coefficients for aggregating neighbors. Thus, we update the representation of each node with a weighted sum of neighbor representations, and attention is responsible for assigning these weights, which allows us to pay more attention to more important nodes.

This algorithm can be tested on several different data sets:

  • The Cora dataset consists of 2,708 publications assigned to one of seven classes. In the citation network, there are 5429 references. The dictionary consists of 1433 unique words.
  • The CiteSeer dataset consists of 3312 publications assigned to one of six classes. In the citation network, there are 4732 references. The dictionary consists of 3703 unique words.
  • The PubMed dataset consists of around 19000 scientific publications from the PubMed database related to diabetes and is assigned to one of three classes. The citation network consists of 44338 references. Each publication in the dataset is described by a weighted TF/IDF vector of words from a dictionary, consisting of 500 unique words.
  • The protein-protein interaction dataset consists of 94359 edges, 20 training graphs, 2 validation graphs, and 2 test graphs. There are only 121 classes in PPI, and multiple classes can be associated with each node, so this is a data set for classification with multiple labels. Each node has 50 features, it is a combination of sets of positional genes, sets of motivating genes, and immunological signatures.

The test shows that GAT outperforms the graph convolutional network by 2% with almost the same RMS value. A comparison of GAT with the best result of the graph convolutional network computing 64 latent features. The GAT model is 20.5% better than the state-of-the-art inductive tuning previously achieved by GraphSAGE.

Source: github.com


Programming can not exist without graph-theory methods and algorithms. The applicability of graphs is wide because they are a natural means of explaining complex situations on an intuitive level. These advantages of representing complex structures and processes with graphs become even more tangible when good visualization tools are available.

Graph tools, like all others dealing with structured data, need to preserve and communicate graphs and data associated with them.

The graphic attention network, using a disguised self-control mechanism, significantly surpassed the most modern models of that time.

The advantages of using an attention-based architecture are as follows:

  • Computational efficiency. There are no expensive matrix operations here, and the algorithm is parallelized across all nodes of the graph.
  • In addition, knowledge of the entire graph structure is not required, and it is possible to deal with neighborhoods of different sizes.

It is an impressive technology that might change how we see and use graphs. And not only in the programming world but in our everyday lives.