Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network

A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could profoundly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over recent years, making it challenging for human researchers to keep track of the progress. Here we use AI techniques to predict the future research directions of AI itself. We introduce a graph-based benchmark based on real-world data—the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 143,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. These results indicate a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools. The number of publications in artificial intelligence (AI) has been increasing exponentially and staying on top of progress in the field is a challenging task. Krenn and colleagues model the evolution of the growing AI literature as a semantic network and use it to benchmark several machine learning methods that can predict promising research directions in AI.


I. INTRODUCTION AND MOTIVATION
The corpus of scientific literature grows at an everincreasing speed.Specifically, in the field of Artificial Intelligence (AI) and Machine Learning (ML), the number of papers every month grows exponentially with a doubling rate of roughly 23 months (see Fig. 1).Simultaneously, the AI community is embracing diverse ideas from many disciplines such as mathematics, statistics, and physics, making it challenging to organize different ideas and uncover new scientific connections.We envision a computer program that can automatically read, comprehend and act on AI literature.It can predict and suggest meaningful research ideas that transcend individual knowledge and cross-domain boundaries.If successful, it could significantly improve the productivity of AI researchers, open up new avenues of research, and help drive progress in the field.
Here, we address this important and challenging * mario.krenn@mpl.mpg.devision.New research ideas often result from drawing novel connections between seemingly unrelated concepts [1][2][3].Therefore, we formulate the evolution of AI literature as a temporal network modelling task.
We created an evolving semantic network characterizing the content and evolution of the scientific literature in the field of AI since 1994.The network contains about 64,000 nodes (each representing a concept used in an AI paper) and 18 million edges that connect two concepts when they were investigated jointly in a scientific paper.We use the semantic network as an input to 10 diverse statistical and machine-learning methods to predict the future evolution of the semantic network with high accuracy.That is, we can predict which combinations of concepts AI researchers will investigate in the future.Being able to predict what scientists will work on is a first crucial step for suggesting new topics that might have a high impact.
Several of the methods presented in this paper have been contributions to the Science4Cast competition hosted by IEEE BigData 2021, which ran from August to November 2021.Broadly, we can divide the methods into two classes: methods that The doubling rate of papers per months is roughly 23 months, which might lead to problems for publishing in these fields, at some point.The categories are cs.AI, cs.LG, cs.NE, and stat.ML.
use hand-crafted network-theoretical features, and those that automatically learn features.We found that models using carefully hand-crafted features outperform methods that attempt to learn features autonomously.This (somewhat surprising) finding indicates a great potential for improvements of models free of human priors.
Our manuscript has several purposes.First, we introduce a new meaningful benchmark for AI on real-world graphs.Second, we provide nearly 10 diverse methods that solve this benchmark.Third, we explain how solving this task could become an essential ingredient for the big picture goal of having a tool that could suggest meaningful research directions for scientists in AI or in other disciplines. 1 The manuscript is structured in the following way.We first introduce more background into semantic networks and how they can help to suggest new ideas.Then we explain how we generate the dataset and some of its network-theoretical properties.Then we briefly explain the 10 methods that we have investigated to solve the task.We conclude with a number of important open questions that could bring us further toward the goal of AI-based suggestions for research directions.

II. SEMANTIC NETWORKS
The goal here is to extract knowledge from the scientific literature that can subsequently be processed by computer algorithms.At first glance, a natural 1 github.com/artificial-scientist-lab/FutureOfAIviaAIfirst step would be to use the features of a large language model (such as GPT3 [4], Gopher [5], Mega-Tron [6] or PaLM [7]) from the text of each article to extract concepts automatically.However, those methods still struggle in reasoning capabilities [8,9], thus it is not yet directly clear how these models can be used for identifying and suggesting new ideas and concept combinations.
An alternative approach has been pioneered by Rzhetsky and colleagues [10].They have shown how knowledge networks (or semantic networks) in biochemistry can be created from co-occurring concepts in scientific papers.The nodes in their network correspond to scientific concepts-concretely, the names of individual biomolecules.The nodes are linked when a paper mentions both of the corresponding biomolecules in its title or abstract.Taking millions of papers into account leads to an evolving semantic network that captures the history of the field.Using supercomputer simulations, nontrivial statements about the collective behaviour of scientists can be extracted, which allows for the suggestions of alternative and more efficient research behaviour [11].Of course, by creating a semantic network from concept co-occurrences, only a tiny amount of knowledge is extracted from each paper.However, if this process is repeated for a large dataset of papers, the resulting network captures nontrivial and actionable content.
The idea to build up a semantic network of a scientific discipline was then applied and extended in the field of quantum physics [12].There, the authors (including one of us) built a network of more than 6,000 quantum physics concepts.The authors formulate the task of predicting new research trends and connections for the first time as an ML task.The task was to identify which concept pairs, which have never been discussed jointly in the scientific literature, have a high probability to be investigated in the near future.This prediction task was phrased as one component for personalized suggestions of new research ideas.

A. Link Prediction in Semantic Networks
Here we formulate the predictions of future research topics as a link prediction task in an exponentially growing semantic network in the field of AI.Two nodes that do not share an edge have not been mentioned together in the title or abstract of an existing scientific paper.Here, the goal is to predict which unconnected nodes will be connected in the future-that is, determine which scientific concepts that have not been researched yet will be jointly researched in the future.From arXiv to Science4Cast.We use 143,000 papers in AI and ML categories on arXiv from 1992 to 2020.From there, we construct a list of concepts (using RAKE and other NLP tools).Those concepts form the nodes of a semantic network.The edges are drawn when two concepts occur jointly in the title or abstract of a paper.In that way, we generate an evolving semantic network that grows over time as more concepts are investigated together.The task is to predict, from unconnected nodes (i.e.concepts that have not been investigated together in the scientific literature), which will be connected within a few years.In this manuscript, we present 10 diverse statistical and machine learning methods to solve this challenge.
Link prediction is a very common problem in computer science that can be solved with classical metrics and features as well as machine learning techniques.From the network theory side, several works have studied local motif-based methods [13][14][15][16][17], often based on path-counting, while other methods have studied more global features using linear optimization [18], global perturbations [19] and stochastic block models [20].Other machine-learning works have tried to optimize over a combination of hundred of predictors [21].Further discussion on these methods is available in a recent review on link prediction [22].
In [12], this task was solved by computing 17 hand-crafted features of the evolving semantic network.In the Science4Cast competition, the goal was to find more precise methods for link-prediction tasks in semantic networks (a semantic network of AI that is 10 times larger than the one in [12]).Specifically, on the one hand, we would like to determine which features are useful; on the other hand, we would also like to know whether this task can be solved efficiently without hand-crafted features.Here, we present results for both questions.

B. Potential for Idea Generation in Science
The long-term goal of predictions and suggestions in semantic networks is to provide new ideas to individual researchers.In a way, we hope to build a creative artificial muse in science [23].We can bias or constrain the model to give research topics that are related to the research interest of individual sci-entists, or a pair of scientists to suggest topics for collaborations in an interdisciplinary setting.Important future questions concern the discovery of impactful and surprising suggestions, and suggestions that give more context than two scientific concepts.

A. Dataset Construction
We use papers that are published on arXiv in the categories cs.AI, cs.LG, cs.NE, and stat.ML, from 1992 to 2020, to create a dynamic semantic network.The nodes stand for computer science and in particular artificial intelligence concepts.We create the list of concepts from the title and abstracts of all of the 143,000 papers.We use Rapid Automatic Keyword Extraction (RAKE) to create candidate concepts [24], and normalize the list using standard NLP techniques and other self-created methods.Ultimately, this leads to a list of 64,719 concepts.
These concepts form the nodes of the semantic network.The edges are drawn when two concepts co-appear in a title or abstract of a paper.Each edge has a time stamp, which is the publication date of the paper in which the concepts co-appear.Multiple edges with different time-stamps between two concepts are very common, as concept pairs can coappear in many papers with different publication dates.As edges have time stamps, the entire semantic network is evolving in time.The workflow is depicted in Fig. 2.

B. Network-Theoretical Analysis
We start by analyzing the degree distribution of the published semantic network.The network has 64,719 nodes and 17,892,352 unique undirected edges, which implies a mean node degree of about 553.However, the network contains many hub nodes that significantly exceed this mean degree, demonstrated by the heavy-tail degree distribution in Fig. 3.For example, the ten highest node degrees (and their corresponding concepts) are 466,319 (neural network), 198,050 (deep learning), 195,345 (machine learning), 169,555 (convolutional neural network), 159,403 (real world), 150,227 (experimental result), 127,642 (deep neural network), 115,334 (large scale), 89,267 (high dimension), and 84,956 (high dimensional).
To investigate whether this complex network is scale-free, we fit a power-law curve to the degree distribution p(k) using [25], and the software fit p(k) ∝ k −2.28 for degree k ≥ 1672.Nevertheless, the degree distribution of real complex network do not always follow perfect power-laws and power-laws with exponential cut-offs are often a better fit than pure power-laws [26].
A recent work [27] empirically showed that lognormal distributions fit most real-world networks as well as or better than power laws, and confirmed that pure "scale-free networks are rare".In light of that result, we used likelihood ratio tests to compare the power law fit with alternative distributions.The likelihood ratio tests from [25] suggested that truncated power law (p-value: 0.0031), lognormal (p-value: 0.0045), and lognormal positive (p-value: 0.015) fit the data better than power law, while ex-   Connected Components (CCs) ponential (p-value: 3e-10) and stretched exponential (p-value: 6e-05) were worse.We could not conclude whether truncated power law, lognormal, or lognormal positive best describe the data with pvalue ≤ 0.1.Next, we discuss changes in the network connectivity over time.
While the degree distributions maintained a heavy tail over the years, the ordering of the nodes inside the heavy tail changed, likely in response to the popularity trends in the field.The nodes with most connections (and the year they became so) are decision tree (1994), machine learning (1996), logic program (2000), neural network (2005), experimental result (2011), machine learning (2013), and finally, back to neural network (2015).
Furthermore, the network grew more connected over time according to connected component analysis in Fig. 4. Groups that were previously separated became connected, i.e. number of connected components decreased, while the largest group grew bigger.The trajectory of the midsized connected components may reveal interesting trends about their topics.Take image processing for instance.A connected component of the following 4 nodes appeared in 1999: brightness change, planar curve, local feature, and differential invariant.In 2000, 3 more nodes joined the group: similarity transformation, template matching, and invariant representation.Then in 2006, a paper that discusses both support vector machine and local feature merged this mid-size group of nodes to the largest connected component.
Another trend that emerges from the semantic network is an increase in centralization over time, with fewer percentage nodes (concepts) contributing larger fraction edges (concepts combination) over the years.This trend seems to be consistent across the entire period of the dataset.It can be seen from the histogram in Fig. 5 that the fraction of edges corresponding to the highest degree nodes (most connected) increases over the years, while the fraction of edges corresponding to the least connected nodes decreases.This trend is also consistent with the decrease in the average clustering coefficient over time (average clustering coefficient by year: 1999: 0.919, 2004: 0.844, 2009: 0.773, 2013: 0.650 ), implying most nodes are less likely to be connected with each other and more likely to be connected to a few highdegree central nodes.This trend might be explained by the fact that the AI community has been focusing on a few methods (e.g.deep learning) which have grown to dominate the field, compared to more diverse approaches in the 90s and 2000s.An alternative explanation is the use of more consistent terminology.(1999,2004,2009,2014).The graph was created by going over the edges list and adding to each year only the edges and nodes that are dated before the year (hence the 2014 plot contains all the nodes (concepts) in papers before 2014).The nodes are arranged by increasing degrees.The plot is a cumulative graph; hence the y value in the x=80, is the fraction of edges contributed by all the nodes in and below the 80s percentile of degrees.

C. Problem Formulation: Predictions in an exponentially growing semantic network
The concrete task is to predict which two nodes v i with degrees d(v i ) ≤ c that do not share an edge in the year (2021-δ) will have w edges in the year 2021.We use δ = 1, 3, 5, c = 0, 5, 25 and w = 1, 3. Note that c = 0 is an interesting special case in which the node does not have any edge associated to it yet in the initial year.Thus, the model does not have any information about the node yet; the task there is to predict which nodes will be connected to entirely new edges.The task w = 3 goes beyond simple link prediction, and asks which uninvestigated concept pair will be studied together in at least 3 papers. 2n the task, we provide a list of 10 million unconnected nodes pairs (each node having a degree≤c) of the year (2021-δ), and the goal is to sort this list from highest to lowest probability that in 2021 they will have at least w edges.
For the evaluation we use the ROC curve [28]; see Fig. 6 for details.The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings.Our evaluation metric is the commonly used metric Area under the Curve (AUC) of the ROC curve.One advantage of AUC over mean-square-error is its independence of the data distribution.Specifically, in our case, where the two classes are highly asymmetrically distributed (with only about 1-3% of newly connected edges), and the distribution changing over time, the AUC provides a meaningful and interpretation.For perfect predictions, AUC=1, while random predictions give AUC=0.5.It gives the percentage that a random true element is higher ranked than a random false one.

IV. AI-BASED SOLUTIONS
We now demonstrate how to solve this task with numerous different methods, from pure statistical approaches to hand-crafted features (NF) as an input of a neural network, to ML models that can work without hand-crafted features.All results are shown in Fig. 7.The most powerful methods (those with the highest prediction quality measured by the AUC metric) take advantage of NF, which are the inputs to an ML model.Surprisingly, using purely network theoretical features without machine learning works competitively.Pure ML methods were not yet able to outperform those that use hand-crafted features.It remains an important open challenge how to solve this task without relying on hand-crafted features.While the prediction of new links can reach an AUC of up to 93%, we find that the prediction of links that are generated at least three times can be solved with AUC¿99.5%.Understanding this apparently quasi-deterministic pattern in AI research will be an interesting target for follow-up research. 3

A. M1: Features+ML
The solution of team oahciy is based on a blend of a tree-based gradient boosting approach and a graph neural network approach [29].Extensive feature engineering was conducted to capture the centralities of the nodes, the proximity between node pairs, and their evolution over time.The centrality of a node is captured by the number of neighbours and the PageRank score [30], while the proximity between a node pair is derived using the Jaccard index.We refer the reader to [29] for the list of all features and their feature importance.
The tree-based gradient boosting approach uses the Light Gradient Boosting Machine (LightGBM) [31] and applies heavy regularization to combat overfitting due to the scarcity of positive samples.The graph neural network approach employs a timeaware graph neural network to learn node representations on dynamic semantic networks.

B. M2: Features+ML
The method proposed by Team HashBrown assumes that the probability that nodes u and v form an edge in the future is a function of the node features f (u), f (v), and some edge feature h(u, v).We chose node features f that capture popularity at the current time t 0 (such as degree, clustering coefficient [32,33], and PageRank [30]).We also use these features' first and second time-derivatives to capture the evolution of the node's popularity over time.After variable selection during training, we chose h to consist of the HOP-rec score [34,35] and a variation of the Dice similarity score [36] as a measure of similarity between nodes.In summary, we use 31 node features for each node, and two edge features, which gives 31 × 2 + 2 = 64 features in total.These features are then fed into a small multilayer perceptron (MLP) (5 layers, each with 13 neurons) with ReLU activation.
Cold start is the problem that some nodes in the test set do not appear in the training set.Our strategy for a cold start is imputation.We say a node v is seen if it appeared in the training data, and unseen otherwise; similarly, we say that a node is born at time t if t is the first time stamp where an edge linking this node has appeared.The idea is that an unseen node is simply a node born in the future, so its features should look like a recently born node in the training set.If a node is unseen, then we impute its features as the average of the features of the nodes born recently.We found that with imputation during training, the test AUC scores across all models consistently increased by about 0.02.For a complete description of this method, we refer the reader to [37].

C. M3: Features+ML
This approach, detailed in [38], uses hand-crafted node features that have been captured in multiple time snapshots (e.g.every year) and then uses an LSTM to benefit from learning the time dependencies of these features.The final configuration uses two main types of features: node features including degree and degree of neighbours, and edge features including common neighbours.And to balance the training data the same number of positive and negative instances have been randomly sampled and combined.
One of the goals was to identify features that are very informative with a very low computational cost.We found that the degree centrality of the nodes is the most important feature, and the degree centrality of the neighbouring nodes and the degree of mutual neighbours gave us the best tradeoff.As all of the extracted features distributions are highly skewed to the right, meaning most of the features take near zero values, using a power transform like Yeo-Johnson [39] helps to make the distributions more Gaussian which boosts the learning.Finally, for the link prediction task, we saw that LSTMs perform better than fully connected neural networks.

D. M4: pure Features
The following two methods are based on a purely statistical analysis of the test data and are explained in detail in [40].
Preferential Attachment -In the network analysis we concluded that the growth of this dataset tends to maintain a heavy-tailed degree distribution, often associated with scale-free networks.As mentioned before the γ-value of the degree distribution is very close to 2, suggesting that preferentialattachment [41] is likely the main organizational principle of the network.As such, we implemented a simple prediction model following this procedure.Preferential-attachment scores in link prediction are often quantified as with k i,j the degree of nodes i and j.However, this assumes the scoring of links between nodes that are already connected to the network, that is k i,j > 0, which is not the case for all the links we must score in the dataset.As a result, we define our preferential attachment model as Using this simple model with no free parameters we could score new links and compare them with the other models.Immediately we note that preferential attachment outperforms some learning-based models, even if it never manages to reach the top AUC, but it is extremely simple and with negligible computational cost.
Common Neighbours -We explore another network-based approach to score the links.Indeed, while the preferential attachment model we derived performed well, it uses no information about the distance between i and j, which is a popular feature used in link prediction methods [22].As such we decided to test a method known as Common Neighbours [13].If we define Γ(i) ∩ Γ(j) as the set of common neighbours between nodes i and j.We can easily score the nodes with the intuition being that nodes which share a larger number of neighbours are more likely to be connected than distant nodes that do not share any.Evaluating this score for each pair (i, j) on the dataset of unconnected pairs, which can be computed as the second power of the adjacency matrix, A 2 , we obtained an AUC which is sometimes higher than preferential attachment and sometimes lower than it but is still consistently quite close with the best learning-based models.

E. M5: Features + ML
This method is based on [42] with a modification disclosed in the VI C. First, 10 groups of firstorder graph features are extracted to get some neighbourhood and similarity properties from each pair of nodes: degree centrality of nodes, pair's total number of neighbours, common neighbours index, Jaccard coefficient, Simpson coefficient, geometric coefficient, cosine coefficient, Adamic-Adar index, resource allocation index, and preferential attachment index.They are obtained for three consecutive years to capture the temporal dynamics of the semantic network, leading to a total of 33 features.Second, principal component analysis (PCA) [43] is applied to reduce the correlation between features, speed up the learning process and improve generalization, which results in a final set of 7 latent variables.Lastly, a random forest classifier is trained (using a balanced dataset) to estimate the likelihood of new links between the AI concepts.

F. M6: Features+ML
The baseline solution for the Science4Cast competition was closely related to the model presented in [12].It uses 15 hand-crafted features of a pair of nodes v 1 and v 2 (Degrees of v 1 and v 2 in the current year and previous two years, these are six properties.The number of shared neighbours in total of v 1 and v 2 in the current year and previous two years are six properties.The number of shared neighbours between v 1 and v 2 in the current year and the previous two years, these are 3 properties).These 15 features are the input of a neural network with four layers (15, 100, 10, and 1 neurons), intending to predict whether the nodes v 1 and v 2 will have w edges in the future.After the training, the model computes the probability for all 10 million evaluation examples.This list is sorted and the AUC is computed.
G. M7: end-to-end ML (Transformers) This model, which is detailed in [44], does not use any handcrafted features but learns them in a completely unsupervised manner.To do so, we extract various snapshots of the adjacency matrix through time, capturing graphs in the form of A t for t = 1994, . . ., 2019.We then embed each of these graphs into 128-dimensional Euclidean space via Node2vec [45,46].For each node u in the semantic graph, we extract different 128-dimensional vector embeddings n u (A 1994 ), . . ., n u (A 2019 ).
Transformers have performed extremely well in natural language processing tasks [47], thus we apply them to learn the dynamics of the embedding vectors.We pre-train a transformer to help classify node pairs.For the transformer, the encoder and decoder had 6 layers each; we used 128 as the embedding dimension, 2048 as the feedforward dimension and 8-headed attention.This transformer acts as our feature extractor.Once we pre-train our transformer, we add a 2-layer ReLU network with hidden dimension 128 as a classifier on top.

H. M8: end-to-end ML (auto node embedding)
The most immediate way one can apply machine learning to this problem is by automating the detection of features.Quite simply, the baseline solution M6 is modified such that instead of 15 hand-crafted features, the neural network is instead trained on features extracted from a graph embedding.In our approach, we use the ProNE embedding [48], which is based on sparse matrix factorizations modulated by the higher-order Cheeger inequality [49], as well as Node2Vec [45].We use the implementations provided in the nodevectors Python package [50].
The embeddings learn a 32-dimensional representation for each node; hence, each edge representation is normalized to a single point in [0, 1] 64 , and the concatenated features are the input of a neural network with two hidden layers of size 1000 and 30, respectively.Similarly to M6, the model is then tasked with computing the probability for the evaluation examples, which lets us determine the ROC.

V. EXTENSIONS AND FUTURE WORK
Creating an AI that can suggest research topics to human scientists is highly ambitious and challenging.The present work of link prediction for a temporal network to draw connections between existing concepts is only the first step.We point out several extensions and future works that are directly relevant to the overarching goal of AI for AI.
High-quality predictions without feature engineering -Surprisingly, given a graph with already extracted concepts as nodes and edges plotting the time evolution of joint appearance of these concepts in publications, the most powerful methods all used carefully hand-crafted features.It will be interesting to see whether end-to-end deep learning methods can solve tasks without feature engineering.
Fully automated concept extraction -The concept list at the moment is created by a purely statistical text analysis using RAKE.The suggestions by RAKE are then manually inspected and phrases that do not correspond to a concept are removed.While this process can be partially automated (as RAKE often makes the same mistakes which can be captured automatically), it is not a scalable process if one wants to create concept lists for the much larger corpus of science and engineering.A fully automated natural language processing algorithm that can extract meaningful concepts with minimal mistakes would be extremely useful.
Generation of new concepts -Here we predict the emergence of links between two known concepts.One important question is whether an AI algorithm can compose words and generate new concepts.Different from the current work that is mostly supervised, the generation of new concepts is unsupervised, hence more difficult.One approach to address this question has been presented in [51,52].There the authors can detect clusters of concepts with specific dynamics that indicate the formation of a new concept.It will be interesting to see how such emerging concepts can be incorporated into the current framework and used for suggestions for new research topics.
Semantic information beyond concept pairs -At the moment, every article's abstract and title are compressed into several links between concept pairs.This procedure does not represent all information in the article's abstract (let alone, the article itself).The more information one can extract from the article, the more meaningful the predictions and suggestions will be.Extending the representation of the semantic network to more complex data structures, such as hypergraphs [53] are likely to be computationally more demanding but could significantly improve the prediction qualities.It might be also possible to find some ways to decrease the complexity of the analysis using clever tricks.For example, the authors in [54] showed that the maximum node and hyperedge cover problem, two computational NP-hard problems, can be solved in polynomial time for most of the real-world hypergraphs tested.Whether such tricks exist for hyperlink prediction is still an open problem.The inclusion of sociological factors, such as the status of the involved researchers and their affiliations might help in prediction tasks.
Predictions of scientific success -The prediction of a new link between nodes in the semantic network means that we predict which concepts scientists will study for the first time in the future.This prediction however does not say anything about the potential importance and impact of the new connection.As a tool for high-quality suggestions, we need to introduce the prediction of a metric-of-success, for example, estimated citation numbers of the new link or the rate of citation growth over time.This extension seems reasonable given that modelling and predictions of citation information in citation networks (where nodes are papers) is a prominent area of research within the science of science [55,56].Adapting these techniques to semantic networks will be an interesting future research direction.
Anomaly detections -In a way, predicting the most likely new connection between concepts does not necessarily directly coincide with the goal of suggestions of new surprising research directions.After all, those links are predictable, thus potentially not surprising by themselves.While we believe that this type of prediction can still be a very useful contribution for suggestions, there is another way to more directly find surprising combinations, namely by finding anomalies in the semantic networks.Those are potential links that have extreme properties in some metrics.There are powerful deep learning methods for anomaly detection [57,58] and their application in the semantic network presented here might be very interesting.In fact, while scientists tend to study topics in which they are already directly involved [2,3], often higher scientific impact results from the unexpected combination of more distant domains [10], which foster the search for those surprising and impactful associations.
End-to-end formulation -As outlined above, we necessarily decomposed our goal of extracting knowledge from the scientific literature into two subtasks: extracting concepts and building and predicting the evolution of a semantic network resulting from those concepts.This stands in contrast to the dominant paradigm in deep learning that emerged over the last decade of so-called 'end-toend' training based on early spectacular successes [59][60][61][62].In this paradigm, problems are not broken into sub-problems but solved directly using deep differentiable architecture components trained via back-propagation [63,64].If such an 'end-to-end' solution approach to our goal could be achieved it would be interesting to see whether it could replicate the success this deep learning paradigm had in other areas.
Human level machine comprehension -One of the defining goals of the Dartmouth Summer Research Project on Artificial Intelligence in 1956 was the following: 'An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.' [65].Such an algorithm would be expected to handle an evolution in concept denotations due to new insights (i.e. the emergence of the term 'Gibbs entropy' to distinguish Boltzmann's original concept of thermodynamical entropy as opposed to seeing it in the light of the more general emergent 'Shannon entropy' or 'von Neumann Entropy') or due to disputed originality (i.e.Bolai-Lobatchevskian Geometry and Hyperbolic Geometry are the same concept).An algorithm with such natural language understanding capabilities would thus be extremely useful to get closer to our goal.Although large language models and other multimodally trained language models like CLIP [66] or CLOOB [67] have achieved outstanding results recently, it is an open question how much statistically trained natural language models alone could eventually form concepts and abstractions on a human level [68,69].

VI. CONCLUSION
Here we present a new AI benchmark for link prediction in exponentially growing semantic networks.Several of the solutions have been collected in the IEEE BigData Competition Science4Cast in fall 2021, and generalized to the mode diverse tasks presented here.The goal was to boost the capabilities of predicting future research directions in the field of AI itself, which grows enormous over the decade.This ability might be an important part of a tool that gives personalized research suggestions to human scientists in the future.We find, rather surprisingly, that the prediction of strong new links (those that are formed three or more times) can be predicted with extremely high quality (AUC beyond 99%).It will be interesting to investigate this quasideterministic pattern in AI research in more detail.The best methods used a clever combination of hand-crafted features and machine learning.It will be interesting whether pure learning methods, with-out hand-crafted features, will achieve high-quality results in the future.We also point out a number of open problems towards the goal of practical, personalized, interdisciplinary AI-based suggestions for new impactful research direction -which we believe could become a disruptive tool in the future.

B. Details on M9
The solution M9 was not part of the Science4Cast competition and therefore not described in the corresponding proceedings, thus we want to add more details.We compare the ProNE embedding to Node2Vec, which is also commonly used for graph embedding problems.The algorithm maps each node of the network to a point in 32dimensional space based on a biased random walk procedure, which is fundamentally parameterized by two variables-p, the "return parameter", and q, the "in-out parameter".The return parameter determines the frequency of backtracking in the random walk, while the in-out parameter determines whether to bias the exploration to nearby nodes or distant nodes.Notably, these parameters significantly affect how the network is encoded-for instance, in the BlogCatalog dataset, optimal parameters were p = 0.25, q = 0.25, whereas for the Wikipedia graph, they were p = 4, q = 0.5 [45].In initial experiments, we used the default p = q = 1 for a 64-dimensional encoding, before feeding it into the same neural network as for the ProNE experiment.The higher variance in Node2Vec-based predictions likely has to do with the method's significant sensitivity to its hyperparameters.While ProNE is clearly better suited for a general multi-dataset link prediction problem, Node2Vec's parameter sensitivity may help us identify what features of the network are most important for predicting its temporal evolution.

C. Consideration for Model M6
In this manuscript, a modification was performed in relation to the original formulation of the method [42]: two of the original features, average neighbor degree and clustering coefficient, were infeasible to extract for some of the tasks covered in this paper, as their computation can be heavy for such a very large network, and they were discarded.Due to some computational memory issues, it was not possible to run the model for some of the tasks covered in this study, and so those results are missing.

Figure 1 .
Figure 1.Number of papers published per months in the arXiv categories of AI grow exponentially.The doubling rate of papers per months is roughly 23 months, which might lead to problems for publishing in these fields, at some point.The categories are cs.AI, cs.LG, cs.NE, and stat.ML.

Figure 2 .
Figure 2.From arXiv to Science4Cast.We use 143,000 papers in AI and ML categories on arXiv from 1992 to 2020.From there, we construct a list of concepts (using RAKE and other NLP tools).Those concepts form the nodes of a semantic network.The edges are drawn when two concepts occur jointly in the title or abstract of a paper.In that way, we generate an evolving semantic network that grows over time as more concepts are investigated together.The task is to predict, from unconnected nodes (i.e.concepts that have not been investigated together in the scientific literature), which will be connected within a few years.In this manuscript, we present 10 diverse statistical and machine learning methods to solve this challenge.

Figure 3 .
Figure 3. Node degrees follow heavy-tail distribution due to the hubs.Nodes with the largest (466,319) and smallest (2) non-zero degrees correspond to neural network and video compression technique, respectively.The most common non-zero degree is 64.1,247 nodes with zero degrees are not shown in this plot, and both axes are in log scale.

Figure 5 .
Figure 5. Centralization of Concepts.The fraction of nodes (concepts) that corresponds to the fraction of edges (connections).Cumulative histogram of edges per node, up to a given year(1999,2004,2009,2014). The graph was created by going over the edges list and adding to each year only the edges and nodes that are dated before the year (hence the 2014 plot contains all the nodes (concepts) in papers before 2014).The nodes are arranged by increasing degrees.The plot is a cumulative graph; hence the y value in the x=80, is the fraction of edges contributed by all the nodes in and below the 80s percentile of degrees.

Figure 6 .
Figure 6.Receiver operating characteristic curve (ROC) for computing the Area under the Curve (AUC).Random Predictions get the result right in half of the cases, therefore their ROC curve is a diagonal with an AUC=0.5 (orange).A model that has learned some properties of the dataset has a AUC > 0.5 (blue).

Figure 7 .
Figure 7.The Science4Cast benchmark: Link predictions in an exponentially growing semantic network.Here we show the AUC values for different models that use machine learning techniques (ML), hand-crafted network features (NF) or a combination thereof.The left plot shows results for the prediction of a single new link (i.e., w = 1), the right one shows results for the prediction of new triple links w = 3.The task is to predict δ = [1, 3, 5] years into the future, with cutoff values c = [0, 5, 25].We sort the models by the the results for the task (w = 1, δ = 3, c = 0), which was the task in the Science4Cast competition.Data points that are not shown have a AUC below 0.6 or are not computed due to computational costs.Note that the prediction of new triple edges can be performed nearly determinstically.It will be interesting to understand the origin of this quasi-deterministic pattern in AI research.
Figure 4.The network became more connected over the years.Primary (left, blue) vertical axis: Number of connected components with more than one node.Secondary (right, orange) vertical axis: Number of nodes in the largest connected component.For example, the network in 2019 comprises of one large connected component with 63,472 nodes and 1,247 isolated nodes, i.e. nodes with no edges.On the other hand, the 2001 network has 19 connected components with size greater than one, the largest of which has 2,733 nodes.