Much of the information around us can be represented as graphs. One example is city road networks, in which intersections are the nodes and roads are the links. Another is regulatory networks, which describe how different genes interact to enhance or suppress certain cellular functions. Graph neural networks (GNNs) can process graph-based information to make predictions. In this issue, Xue et al. use GNNs to study and predict socioeconomic features on the basis of topological features of city road networks from 30 cities across the globe.

Credit: Orbon Alija / E+ / Getty

GNNs emerged over a decade ago as a way to tackle complex data that are best represented as graphs of many links and dependencies between objects. Convolutional neural networks (CNNs) excel at processing data such as images, text or video. These can be thought of as simple graphs or sequences of fixed size and shape. But much of the data around us takes a more complex form of graphs of irregular size and shape, such as social networks, road networks, genomic data or molecular interactions. CNNs struggle to process data of this form, which is where GNNs come in. They learn to encode information about the local surroundings of each node in a graph. GNNs can perform prediction or classification tasks at the level of the whole graph, or for each node or edge.

Graph-based deep learning has found success in many areas, from recommender systems to traffic time predictions. But GNNs have also proven to be useful in scientific applications such as genomics, molecular design, drug development and physics simulations. A promising genomics example can be found in a recent paper by Schulte-Sasse et al. In the past two decades, high-throughput sequencing techniques, in combination with epigenetic and transcriptomic molecular data, have provided a wealth of information for genomics, some of which can be mined to identify genes that cause cancer. The picture of how cancer genes cause cell growth has become more complex in recent years: cancer can arise via several routes beyond the mutation of a gene at the DNA sequence level, and epigenetic mechanisms or noncoding mutations in regulatory regions that indirectly activate or silence other genes can also have a role. Multi-omics datasets can shed light on such processes. The data can be modelled as biological networks, or graphs, in which nodes represent genes and links represent gene–gene interactions. The authors use an advanced type of GNN — graph convolutional networks — which can classify unlabelled nodes in a network on the basis of both the node feature vectors and the network’s topology. With their method, using multidimensional multi-omics node features as well as topological features of a protein–protein interaction network, it is possible to identify not only highly mutated cancer genes but also genes that contain other kinds of alteration, or genes that interact with other cancer genes. Feeding genomic data from 16 cancer types into the method, the study identified 165 possible new candidate genes that can cause cancer.

The machine learning method used by Schulte-Sasse et al. — semi-supervised classification with graph convolutional networks — was introduced in a seminal paper by Kipf and Welling in 2017. It has stimulated many advances in graph-based machine learning over the past 5 years. A News & Views article in this issue by Haghir Chehreghani highlights this paper and discusses the efficiency, explainability and scalability of the approach.

Molecular prediction and drug discovery is another area for graph-based approaches. The area has used machine learning for several decades in various creative ways, linked to different methods for representing molecules. One approach is to represent molecules as linear strings of characters (also known as ‘simplified molecular-input line-entry system’ or ‘SMILES’), which lends itself to deep learning methods in natural language processing. Alternatively, to preserve the topological information of molecules, they can be represented as graphs, with atoms as nodes and bonds as links. A recent paper by Wang et al. describes a self-supervised GNNs-based method to tackle the problem of relative scarcity of labelled data for molecules compared with the enormous space of potentially bioactive molecules (which is thought to be of the order 1060). The authors’ framework is trained on a large unlabelled dataset with around 10 million molecules, and learns molecular representations that can be used to distinguish chemical properties.

Another component that can be added to graph representations of molecules is their geometric structure, given that different isomers of molecules with the same topology can have different chemical properties. A recent paper by Fang et al. describes how encoding both the topological and geometric structure of molecules, in a geometry-enhanced GNN approach, can improve molecular property predictions.

As highlighted by Chehreghani, there is still a need to develop a better theoretical framework for GNNs, and to understand their limitations better. A more foundational understanding will stimulate further exciting applications of GNNs in many areas given the growing amount of complex, highly connected data.