Structure-inducing pre-training

Language model pre-training and the derived general-purpose methods have reshaped machine learning research. However, there remains considerable uncertainty regarding why pre-training improves the performance of downstream tasks. This challenge is pronounced when using language model pre-training in domains outside of natural language. Here we investigate this problem by analysing how pre-training methods impose relational structure in induced per-sample latent spaces—that is, what constraints do pre-training methods impose on the distance or geometry between the pre-trained embeddings of samples. A comprehensive review of pre-training methods reveals that this question remains open, despite theoretical analyses showing the importance of understanding this form of induced structure. Based on this review, we introduce a pre-training framework that enables a granular and comprehensive understanding of how relational structure can be induced. We present a theoretical analysis of the framework from the first principles and establish a connection between the relational inductive bias of pre-training and fine-tuning performance. Empirical studies spanning three data modalities and ten fine-tuning tasks confirm theoretical analyses, inform the design of novel pre-training methods and establish consistent improvements over a compelling suite of methods. Designing methods to induce explicit and deep structural constraints in latent space at the sample level is an open problem in natural language processing-derived methods relying on transfer learning. McDermott and colleagues propose and analyse a pre-training framework imposing such structural constraints, and empirically demonstrate its advantages by showing that it outperforms existing pre-training state-of-the-art methods.


Main
The pre-training (PT)/fine-tuning (FT) learning paradigm (also known as transfer learning) has had tremendous impact on natural language processing (NLP) and related domains [2,35,72]. In NLP or NLP-derived PT/FT, we are given a dataset X ∈ X N PT and attempt to pre-train an encoder f θ : X → Z which maps our domain of interest X into a latent space Z: f θ : x i → z i . This encoder f θ is then transferred for use in various fine-tuning tasks (which are not known at pretraining time). We evaluate PT/FT systems via the transfer performance of f θ on said fine-tuning tasks.
In this work, we are concerned primarily with the efficacy of PT/FT for downstream tasks that operate at a per-sample level (e.g., in natural language processing, evaluating the sentiment of a whole restaurant review is a per-sample task, in contrast to identifying a named entity token within a sentence which is an intra-sample/per-token task). One aspect of pre-training that drives such eventual fine-tuning performance is the induced geometry of the pre-trained, per-sample latent space Z (formally defined in the Methods section). For example, it is well documented that the sentence embeddings produced by pre-trained language models in NLP can be non-smooth and anisotropic, which harms downstream task performance [73]. In other domains, such as biomedical modalities, where per-sample tasks are even more prevalent than intra-sample tasks as compared to NLP, the importance of this geometry only increases. Despite this importance, research into mechanisms to induce explicit, deep structural constraints in Z is surprisingly limited. Many methods outright ignore the geometry of Z (e.g., by imposing no pre-training loss over the whole-sample embeddings during pre-training) [2,4,5,5] and other methods impose either only shallow structural constraints, such as through an auxiliary, per-sample, classification PT objective [35,40,42], or deeper structural constraints, but in an implicit manner, such as through data-augmentation [56,60] or noising-based contrastive losses [57,59]. While such methods can be powerful and have been successful in many areas, we argue that the lack of a clear framework to design PT methods that impose structural constraints on Z that are simultaneously explicit (similar to supervised classification losses) and deep (similar to noising/augmentation-based contrastive losses) is a major weakness.
On the basis of this observation, we develop an analytical framework under which the PT objective is subdivided into two components: first, a language-model inspired imputation/denoising objective that leverages intra-sample relationships, and, second, a loss term explicitly driven to regularize the geometry of the per-sample latent space Z to reflect the connectivity patterns of a user-specified graph G PT . By relying on graphs to capture the structure we wish to induce in Z, this PT framework allows us to specify PT methods that induce deep structure in an explicit manner, filling exactly the gap identified above. In addition, this paradigm can capture diverse relationships, such as those motivated by external knowledge (e.g., [74]), self-supervised constraints (e.g., [75,76]), or distances between samples in an alternate modality (e.g., [69]). Moreover, this PT framework is simultaneously specific enough to allow us to make theoretical guarantees about how different PT graphs impact FT performance, general enough to encompass a variety of existing PT methods, and expressive enough to motivate new PT methods that have not been previously studied. In addition to theoretical analysis, we demonstrate empirically that defining new methods according to our framework, using explicit forms of real-world structure, yields significant benefits over competitive PT baselines across 3 modalities and 10 FT tasks.
Our work advances PT/FT research through three major contributions. First, we show via a comprehensive review and detailed commentary that existing pre-training methods largely do not induce structural constraints over Z that are simultaneously deep and explicit. Second, we establish a new framework for describing PT methods, which provides a vehicle to design new PT methods that explicitly induce deep structural constraints in Z in accordance with a user-specified PT graph G PT . We further support this framework with theoretical results quantifying how the graph's structure relates to FT task performance. Crucially, this formalization in our new PT paradigm offers insight into when PT does or does not add value over supervised learning alone. Third, we show that structure-inducing PT methods through our framework perform at or above the level of existing PT baselines across three data modalities and 10 FT tasks.

General Pre-Training Problem Formulation
Given a dataset X PT ∈ X N PT , a PT method aims to learn an encoder f θ : X → Z such that f θ can be transferred to FT tasks that are unknown at pre-training time. While we can leverage additional information at PT time to inform the training of f θ (e.g., PT-specific labels Y PT ), the encoder f θ must take only samples from X as inputs so that it can be used for fine-tuning. Pretraining methods typically solve this problem by training f θ to minimize a pre-training loss L PT over X PT . For example, in BERT, X consists of free-text samples, f θ is a transformer model, and L PT consists of both a masked language modelling (MLM) per-token loss and the next-sentenceprediction (NSP) per-sample loss [35].
Note that our definition of pre-training ignores secondary applications of the pre-training objective itself; for example, autoregressive language models (e.g., GPT-3 [2]) are often used for their generative use directly, and not as commonly used to acquire embeddings or in transfer learning. This is a perfectly valid use of pre-trained language models within NLP, but is often not as useful in other domains which lack NLP's generative properties, so we focus on the induced embeddings produced by pre-training methods instead. Note further that we are primarily interested in PT methods that either are or are derived from NLP PT methods. This domain is of particular interest because these methods (1) have been extremely successful within NLP [2,35,77], (2) have motivated a large number of derived methods in non-language, biomedical modalities [19,33,43,46], and (3) are not yet fully technically understood [29,73,78].

Defining Explicit and Deep Structural Constraints
Central to our hypothesis is the claim that most NLP-derived PT methods today do not impose explicit, deep constraints on the (per-sample) latent space geometry of Z. To justify this claim, we define "explicit" and "deep" structural constraints (Definitions 1-2).

Definition 1. Explicit vs. Implicit Structural Constraints:
A PT objective L PT imposes a structural constraint that is explicit (vs. implicit) to the degree that it (as f θ approaches optimality) permits us to reason directly about the relationship (in particular, the distance) between any two samples z i and z j in the latent space Z.

Definition 2. Deep vs. Shallow Structural Constraints:
A PT objective L PT imposes a structural constraint that is deep (vs. shallow) on the basis of how much information (e.g., how many dimensions) would be required to fully satisfy the constraint.
For example, consider a classification PT loss according to labels y i ∈ Y and a logit layer which maps z i →ỹ i . This method produces an explicit structural constraint because near optimality, we can infer that the relative (cosine) distance between two samples z i and z j is small if and only if y i = y j . However, this constraint is also shallow, because to fully satisfy this constraint, we need only embed each class c ∈ Y with a unique position p c ∈ Z, then compress all samples z i near their class prototype p y i . This distance-based constraint can be accomplished in a very low dimensional space Z (e.g. we can distribute each p c uniformly about a 2D unit circle, then compress all z i to appear at a very small cosine distance from their class prototypes), illustrating that this constraint is very shallow.
In contrast, consider a contrastive method that asserts that z i = f θ (x i ) should be close to z i = f θ (x i ), under some noising/augmentation procedure x i →x i , but simultaneously far from other samples z j . While this method constrains the latent space to be smooth with respect to the noising process, it offers only an implicit constraint on Z as it is generally not possible to infer how the distance between distinct samples z i and z j is constrained. However, it imposes a deeper constraint than does the classification objective because the implicit connections between samples induced by the noising procedure reflect relationships that can not necessarily be captured in a low-dimensional space (dependent on dataset size and density).

Existing Pre-training Methods do not use Deep, Explicit Constraints
To show that existing methods largely do not provide means to impose structural constraints that are simultaneously deep and explicit, we survey over 90 existing PT methods on the basis of how their objective functions constrain the Z (Figure 1, Appendix A). For full details on our review findings, see the Methods section. Throughout all examined methods, we find that deep, explicit structural constraints are almost never employed. Instead, most methods either (1) impose no persample PT objectives at all (e.g., text-generation models, which are often not used for embeddings at all but rather for prompting or generative applications [2,[4][5][6]), (2) use explicit, but shallow, supervised PT objectives (e.g., BERT's "Next-sentence Prediction" (NSP) objective, ALBERT's "Sentence-order Prediction" (SOP) objective, or various multi-task objectives [35,40,42]), or (3) use implicit, but deep, un-or self-supervised contrastive PT objectives (e.g., contrastive sentence embedding losses [56,57,59,60,79]).
Across all surveyed methods, we find that only four methods impose simultaneously explicit and deep constraints: KEPLER [68], CK-GNN [69], XLM-K [70], and WebFormer [71]. All four can be described as some form of per-sample graph alignment, in which an external, pre-training knowledge graph G PT or connectivity algorithm is employed over a subset of pre-training samples, and the output embeddings of pairs of samples z i = f θ (x i ) and z j = f θ (x j ) are constrained to reflect their relationships in the pre-training graph. This form of constraint is explicit, as the graph G PT contains explicit relationships that will be induced in the output latent space, but also deep, as the geometry of the graph G PT can be arbitrarily complex.
However, all these methods have major limitations. In KEPLER and XLM-K, the per-sample embeddings are only constrained to a restricted set of samples corresponding to entity descriptions from a knowledge graph. As such, there are no constraints implied on the general domain freetext samples in X alone [68,70]. In CK-GNN, the graph connectivity is derived from a clusterrestricted 1-nearest-neighbor graph in an alternate modality's distance space, which may offer a limited higher-order structure, and unlike the NLP approaches, this method has no intra-sample (e.g. per-token) pre-training task [69]. Finally, in WebFormer, the graph used is inferred from the structure of the HyperText Markup Language (HTML) underlying web-pages, and relationships are only constrained at the per-sample level for limited structural relationships within the HTML. Further, WebFormer is a specialized model specifically for processing web content (text and HTML elements), so their approach can't be directly generalized to other domains [71]. Moreover, these methods explore only the particular contexts of their individual models. They offer no general framework for how to realize this style of deep, explicit per-sample constraints in other contexts, nor do they explore any theory on how these constraints relate to performance for fine-tuning tasks [68][69][70][71].
Overall, our review of pre-training methods establishes unequivocally that pre-training methods capable of providing explicit, deep structural constraints are significantly under-explored. Across all the methods we reviewed, only four methods leverage constraints are explicit and deep, all of which have significant limitations, and there is no general consensus on how to constrain the Z explicitly and deeply. These findings motivate our new framework, which offers insight into how to realize deep, explicit structural constraints in pre-training models across diverse contexts and provides theoretical guidance on how structural constraints relate to fine-tuning performance.
New Pre-training Framework: Structure-Inducing Pre-training (SIPT) Our pre-training problem framework includes two small, but important, differences from the standard formulation ( Figure 2).
First, we assume that we have as an additional input to the PT problem a graph G PT = (V, E) where vertices denote pre-training samples within X PT (e.g., {x PT |x PT ∈ X PT } ⊆ V ) and edges represent user-specified relationships. Importantly, while we take the graph G PT an input to the PT problem, we cannot use it as a direct input to f θ . Just like in traditional pre-training, f θ must take as input only samples from X . This is because otherwise, we can not apply f θ to the same, general class of FT tasks over domain X .
Second, we decompose the PT loss L PT into two components, weighted with hyperparameter 0 ≤ λ SI ≤ 1: L M is a traditional, intra-sample objective (e.g., a language model), and L SI is a new, structureinducing objective designed to regularize the per-sample latent space geometry in accordance with the relationships (edges) in G PT . Under our framework, L SI is only allowable for G PT , f θ , and Z if it permits some stable optima at which point a radius nearest-neighbor connectivity algorithm under some distance function in Z will recover G PT (formal constraint is in the Methods section). Note that this constraint strikes a connection between our framework and the wealth of existing research focused on graph representation learning [80][81][82][83][84][85]. These techniques do indeed offer valuable insights into how to sample minibatches over graph-structured data and devise losses for graph embeddings; however, many methods for actually modelling graph-structured data, including deep attributed graph embeddings and graph convolutional neural networks, should not be seen as replacements for our techniques here as they are typically not adaptable to contexts in which the graph is not known at inference time, and so they could not be used in our pre-training setting where f θ must take in only inputs from X directly.
As the new loss term added L SI is explicitly designed to induce the structure of G PT in Z, we call methods trained under our framework structure-inducing pre-training (SIPT) methods. Many existing PT approaches can be re-realized as SIPT methods, including classification-based PT objectives like NSP or SOP, contrastive methods, or existing graph alignment methods (see Methods for full details).

Theoretical Analyses
Under our framework, one can link the structure of the PT graph G PT to eventual FT task performance. In particular, as a SIPT embedder f over graph G PT approaches optimality under the loss L SI , it produces an embedding space such that nearest-neighbor performance for any downstream task is lower bounded by the performance that could be obtained via a nearest neighbor algorithm over graph G PT (Theorem 1). This fact directly connects the geometry of the graph G PT with the eventual fine-tuning performance of a SIPT embedder f . Furthermore, it demonstrates the advantage of employing an explicit constraint rather than an implicit one; by controlling the structure of G PT , users can directly choose to add different inductive biases to the PT process, in a manner which has a provable impact on the eventual suitability for downstream FT tasks. Theorem 1. Let X PT be a PT dataset, G PT be a PT graph, and let f θ * be an encoder pre-trained under a PT objective permissible under our framing that realizes a L SI value no more than * .
Then, under embedder f , the nearest-neighbor accuracy for a FT task y converges as dataset size increases to at least the local consistency (Definition 5) of y over G PT .
We also establish two important corollaries of Theorem 1 that further illustrate the importance of choosing graphs G PT which impose deep structural constraints (Corollaries 1-2).
Then, the local consistency for a given FT task y (FT) over G PT (and thus by Theorem 1, the nearest-neighbor accuracy for any optimized SIPT embedder) is upper bounded by the probability that a sample x i 's fine-tuning label y (FT) i agrees with the majority class label for task y (FT) over the clique consisting of all nodes with the same pre-training label y i as x i . Corollary 2. Let X PT be a PT dataset that can be realized over a valid manifold M. Assume X PT is sampled with full support over M. Let G PT (X PT , E) be an r-nearest-neighbor graph over M (e.g., (x i , x j ) ∈ E if and only if the geodesic distance between the two points on M is less than r: D M (x i , x j ) < r). Let y (FT) be a FT classification task that is almost everywhere smooth on the manifold.
Then, as PT dataset size (and thus the size of G PT ) tends to ∞, and r tends to zero, the local consistency of y (FT) over G PT (and thus by Theorem 1 the nearest-neighbor accuracy of an SIPT embedder) will likewise tend to 1.
Informally, these corollaries establish that when a shallow structural constraint is used (e.g. a supervised classification objective), then the associated SIPT-equivalent model permits only minimal guarantees for FT performance, driven by the extent to which an FT task label is consistent within the classes under the supervised PT objective. In contrast, if a deep structural constraint is used, realized in Corollary 2 via G PT being a nearest-neighbor graph over an arbitrary manifold M, then a SIPT model permits a theoretical guarantee for FT performance that approaches unity as the pre-training dataset size grows for any FT task that is smooth over M.
In sum, this theoretical analysis shows that we can directly connect the structure induced in Z to downstream FT performance. As such, moving to new PT methods which leverage graphs G PT with deeper structural constraints has the potential to markedly improve performance, as we will demonstrate on real-world datasets in our experiments. Complete proofs for all theoretical results and semi-synthetic experiments validating our theoretical findings in practice are in the Methods section.

Real-world Experiments: Datasets and Tasks
We examine three data modalities for our experiments: PROTEINS, containing protein sequences; ABSTRACTS, containing free-text biomedical abstracts; and NETWORKS, containing sub-graphs of protein-protein interaction (PPI) networks.
In each data modality, we use different pre-training datasets and leverage different kinds of pre-training graphs G PT , test on publicly available benchmarks for FT tasks, and compare our SIPT methods to compelling baselines spanning both per-sample and/or per-token methods (Tables 1-3). Further details on these aspects can also be found in the Methods Section.

Real-world Experiments: L SI and Training Procedures
As discussed in the definition of our framework, a SIPT method differs from a standard PT method by (1) the choice of graph G PT (Table 1) and (2) the design of the new, structure-inducing loss L SI . To define L SI in our experiments, we leverage ideas from structure-preserving metric learning (SPML) [86][87][88]. SPML is a form of metric learning where positive relationships are defined by edges in a graph rather than a shared supervised label. We adapt two losses, a traditional contrastive loss [89] and a multi-similarity loss [90], from supervised metric learning to the graphbased, structure-preserving context of L SI terms in SIPT.
In addition to these losses, in the ABSTRACTS and PROTEINS domains, we use a warmstart procedure to initialize pre-training from existing language models rather than beginning from scratch. This saves significant computational time and allows for a powerful ablation study to isolate performance improvements to the introduction of our L SI term. Second, we perform extensive hyperparameter tuning studies on these two domains to identify appropriate values for λ SI , and adapt those findings to the NETWORKS domain. Further details about the experimental setup, including formal statements of our contrastive and multi-similarity losses, are in the Methods section.
Result 1: Incorporating L SI performs comparably to or improves over all baselines across all 3 domains and 10 FT tasks To analyze our experiments, we compute the relative reduction of error of the best performing SIPT model vs. the per-token or per-sample baselines across all FT tasks (Table 2). We can see that in 10/15 cases, SIPT improves over existing methods, and in no case does it do worse than either baseline. In some cases, the gains in performance are quite significant, with improvements of approximately 17% (0.05 macro-F1 raw change) on AA, 6% on SRE (0.01 macro-F1 raw change), and 4% on RH (2% accuracy raw change). SIPT models further establish a new SOTA on AA and RH and match SOTA on FL, ST, & PF.
We see in Figure 3 how performance evolves over FT iterations for the NETWORKS dataset to determine if the improvements observed at the final converged values are present throughout training. We see that SIPT methods converge faster to better performance than both baselines. Raw results across all settings are presented in the Methods section (Tables 7-8).
Result 2: These performance gains are present across diverse modalities and pre-training graphs and outperform both per-sample and per-token baselines SIPT performance gains persist over all three data modalities and all different G PT types we use here. This shows that explicitly regularizing the per-sample latent space geometry offers value across NLP, non-language sequences, and non-sequential domains, as well as while leveraging graphs including those defined by external knowledge, by self-supervised signals in the data directly, and by nearest-neighbor methods over multi-task label spaces. Furthermore, note that these improvements exist not only in comparison to standard language modelling approaches but also against existing methods that impose per-sample PT objectives, including single and multi-task classification objectives.
Result 3: Observed gains are uniquely attributable to the novel loss L SI As outlined in the Methods section, our experimental design permits us to determine how much of the observed gains in Table 2 are due to the novel loss component, as opposed to, for example, continued training, new PT data, or the batch selection procedures used in our method which also indirectly leverage the knowledge inherent in G PT . Unsurprisingly, some gains are observed due to these other factors, and performance gains shrink when considering these ablation studies. However, even when comparing against the maximal performance baseline or ablation study overall, neither the direction of observed relationships nor the statistical significance of observed comparisons changes. Therefore, we can conclusively state that the performance improvements observed here are uniquely attributable to the new, structure-inducing components introduced by our framework. Full ablation study results can be found in the Methods section (Tables 7-8).

Discussion
We show that despite the breadth of research into PT methods, methods for imposing explicit and deep structural constraints over the per-sample, pre-training latent space Z are under-explored ( Figure 1). Our theoretical and empirical analyses show that this deficit matters in practice. In particular, we define a new pre-training framework, structure-inducing pre-training (SIPT), under which the PT loss is subdivided into two components: one which is designed to capture intra-sample (e.g. per-token) relationships and one which is designed to constrain the per-sample latent space to capture relationships between samples given by a user-specified pre-training graph G PT . Under our framework, we show both theoretically and via experiments that the structure induced in Z can be directly connected to eventual fine-tuning performance. Empirically, we show that novel SIPT methods leveraging a variety of pre-training graphs can consistently outperform compelling existing PT methods across three real-world domains.
Our work highlights several important directions for future research. For example, are there losses better suited than metric learning losses for pre-training graphs-e.g., can we leverage the graph distance alongside the intra-batch distance to improve negative sampling strategies? In addition, can we produce theoretical results on convergence of pre-trained models? Can we advance the understanding of when and how pre-trained models converge to solutions that recover G PT ? In a different direction, can pre-trained models reflect forms of structure beyond nearest neighbor relationships-e.g., such as by leveraging higher-order topological considerations or by matching a distance function rather than a discrete graph? We anticipate that further analyses of these and other questions will lead to new pre-training methods and enable pre-training to be successful across diverse domains.
Data availability. Our synthetic datasets and pointers to all real-world datasets used (which are all publicly available) are available here: https://github.com/mmcdermott/structure inducing pre-training.
Code availability. All code for this project is available at https://github.com/mmcdermott/structure inducing pre-training.     Clusters are sized such that the area corresponds to the number of citations methods included in that cluster have received on average per month since first publication, according to Google Scholar's citation count. "None" captures models that leverage no pre-training loss over the per-sample embedding. "NSP" refers to "Nextsentence Prediction," the per-sample PT task introduced in BERT [35]. "SOP" refers to "Sentence-order Prediction," the per-sample PT task introduced in ALBERT [40]. Note that over 90 studies in total were considered in our review, but only 71 met the inclusion criteria to be included in this figure. These methods are described in more detail in Methods x i interacts with x j x i 's paper cites x j 's paper x i 's central protein agrees on all but 9 Gene Ontology (GO) labels with x j 's central protein.
Per-token baseline TAPE [15] SciBERT [91] Attribute Masking [43] Per-sample baseline PLUS [45] None Multi-task learning [43] FT Dataset TAPE [15] SciBERT [91] [43] Table 1: A summary of our datasets, tasks, and benchmarks. For example, for the PROTEINS domain, our pre-training dataset is the set of protein sequences contained in the tree-of-life dataset [74], proteins are linked in our pre-training graph G PT if and only if they interact according to the tree-of-life graph, and we compare over the fine-tuning tasks in the TAPE benchmark against both the raw, per-token baseline publicly available in the TAPE model [15] as well as the per-sample baseline published in the PLUS pre-training model [45].

Domain Task
Vs. Per-Token PT vs. Per-Sample ) of models trained under our framework vs. published per-token or per-sample baselines. Higher numbers indicate models under our framework reduce error more and thus outperform baselines. The ∆ column indicates whether the model offers a statistically significant improvement (↑), no significant change (∼), or a statistically significant decrease (↓). Statistical significance is assessed via a t-test at significance level p < 0.1. Per-sample analysis and variance estimates for CP were infeasible due to the computational cost of this task.

FT Task
Description Metric Name Abbr.
TAPE [15] Remote Homology RH Per-sequence classification task to predict protein fold category.
Accuracy Secondary Structure SS Per-token classification task to predict amino acid structural properties. Multi-label binary classification into 40 Gene Ontology terms Macro-AUROC Table 3: Fine-tuning tasks.

Online Methods
Per-token vs. Per-sample Latent Space: Definition of Z Let f θ be a pre-training (PT) model trained over a dataset X ∈ X N PT . Furthermore, let us assume that samples x ∈ X are composed of smaller units (e.g. tokens, sequence time-points, nodes in a network, etc.). Let us denote this by saying that x = w 1 , w 2 , . . . , w nx . Finally, as is true in natural language processing (NLP) and NLP-derived settings, we assume that f θ can be seen to produce output embeddings for both the entire sample x-which we will denote by f θ (x)-and for the internal tokens individually-which will denote by f θ (w j |x). For example, in the BERT model [35], f θ (x) will be given by the output embedding of the [CLS] token of x and f θ (w j |x) will be given by the output embedding of the j-th token in x.
We can then formally define the per-sample latent space, Z (S) (which we will also refer to as Z without the superscript), and the per-token (aka intra-sample) latent space Z (T) (Definitions 3 & 4, and Figure 4).

Definition 3. Per-Sample Latent Space
We define the per-sample latent space induced by f θ as Z (S) = {f θ (x)|x ∈ X }. We will also use Z with no superscript to refer to this space.
Definition 4. Per-token/Intra-sample Latent Space We define the per-token latent space (also known as the intra-sample latent space) induced by f θ as Both of these spaces are very different and are useful in different contexts; for a task like named entity recognition, where the unit of classification is a single or short span of tokens, analyzing the per-token latent space will be more informative, whereas for a task like sentiment analysis, where the unit of classification is an entire sample (sentence), the per-sample latent space would be preferred [35]. Furthermore, another key difference between these spaces is that the traditional PT language model objective only induces significant constraints on the geometry of the per-token latent space and does not impact the per-sample latent space at all. This illustrates a gap in the capabilities of PT methods. In our work, we are concerned with precisely this gap and focus our attention on Z (i.e. Z (S) ). We focus our attention on the per-sample latent space for 3 reasons: 1. There has been significantly more research on how to regularize the per-token latent space than the per-sample latent space, as we show in our extensive review (Table 4).
2. In many domains outside of NLP, the per-sample latent space is often of much greater interest than the intra-sample latent space. For example, in modelling protein sequences [15], drug structures [43], or electronic health record time series [46], per-sample tasks are of much greater interest than intra-sample tasks.
3. Even within NLP, modern methods struggle much more with representing whole passages of text rather than short, isolated spans. This is evidenced by the battery of work examining sentence representations atop pre-trained language models [73,92].

Why is NLP Different than Other Domains?
In this work, we have implicitly argued that because a PT objective like masked language modelling (MLM) will not necessarily directly enrich the per-sample latent space Z (S) , it may yield models less well suited to downstream per-sample tasks than other approaches. One seeming contradiction to this is that methods in NLP like RoBERTa [4] (for which MLM is the only PT objective) succeed across both per-token and per-sample tasks.
In fact, this observation does not contradict our hypothesis but reflects a unique advantage of the natural language modality that does not apply in other domains. In particular, in the NLP domain (and not in other domains), we can leverage the flexibility of the language to sidestep any deficit in Z (S) by re-framing per-sample tasks as per-token, language modelling tasks. Significant literature exists documenting this phenomenon through the lenses of prompting, cloze-filling models, text-to-text transformers, and theoretical analyses [2,3,11,77,93]. For example, [93] examines the efficacy of pre-trained language models on sentiment analysis explicitly and show that the language modelling component alone can be used in a per-token manner to indirectly solve a review sentiment analysis task by judging the likelihood of following the review with a ":)" emoji vs. a ":(" emoji. In this way, they shift the per-sample task of sentiment analysis to a per-token task via the (inserted) emoji. However, language model pre-training has also inspired many derived methods to be used in other non-NLP domains. For example, in modelling graphs, [43] has examined vertex or edgemasking strategies reminiscent of MLM, with vertices and edges analogous to tokens and entire graphs whole samples; in modelling time series data, [46] has examined masked imputation models, with timepoints analogous to tokens and whole time series to samples; and in modelling protein sequences, [45] has used masked language modelling directly, with individual amino acids representing tokens and entire proteins representing samples. In all three of these domains, we cannot re-frame per-sample tasks as "per-token" tasks as we can in NLP, and accordingly, the problem of insufficient per-sample latent space regularization will likely be much more severe in these domains. Accordingly, existing work, including the three works referenced above, all find that augmenting the language model pre-training task with additional, per-sample level supervised tasks can be beneficial, or even absolutely essential, to improving performance [43,45,46,94].

Pre-training Review Methodology
Papers were selected via a manual search of the natural language processing (NLP) and NLPderived pre-training methods (i.e., methods focused primarily on other domains or on multi-modal domains were excluded) via Google Scholar as well as by crawling through references of papers already included. Citation counts for each work were obtained via Google Scholar on August 2nd, 2022. Publication date (used to calculate citations per month since publication date) was computed as the earlier of either (1) the paper's venue-specific date of publicatoin or (2) the first submission date to the arXiv or BioRxiv platform, as referenced via an exact title match. A manual review was done to classify how pre-training methods constrain latent space geometry and assign subjective, numerical "shallow-deep" and "explicit-implicit" axes scores. In total, over 90 methods were examined, of which 71 were suitable for inclusion in numerical review results ( Figure 1 and Table 4). All methods considered are summarized and categorized (and reasons for exclusions are given) in Appendix A.

Further Analysis of Reviewed Methods
This work has extensively examined how existing pre-training methods constrain the per-sample latent space. However, it is also worth examining how these methods constrain the per-token latent space to demonstrate the extent to which per-sample objectives are under-explored in current pretraining research. To that end, we break down all of the studies included in our review not only by how they constrain their per-sample latent spaces but also by how they constrain their per-token latent spaces (Table 4). These groupings are also done at a greater granularity than the previously examined categories to offer more insight into which methods use which techniques. We see that not only are there more types of per-token latent space constraints leveraged (10 vs. 7), but also methods consistently leverage a much greater diversity of per-token constraints vs. per-sample constraints (1.45 per-token constraints per method vs. 0.58 per-sample constraints, on average). We can further see from Figure 1 that the citation volume for works in this space is also heavily concentrated around methods that first employ no per-sample PT objective, followed by methods that only impose shallow, explicit methods, which further establishes this research gap.

Constraints on L SI in our Framework
Formally, for L SI to be valid, then there must exist a distance function d : Z × Z → R, radius r ∈ R, and loss value * ∈ R such that at any solution θ * for which L SI (θ * ) < * , the learned embeddings z i = f θ * (x i ) must recover the graph G PT under a radius nearest neighbor connectivity algorithm via distance function d and radius r-i.e., it must be the case that Furthermore, for the particular graph G PT and latent space Z, the set of θ * such that L SI (θ * ) < * must be non-empty (i.e. such a solution must exist).

Realizing Existing Methods in our Framework
Let X ∈ X N PT be the pre-training dataset throughout this section. In cases where we have some auxiliary information (e.g., supervised, per-sample, pre-training labels), they will be denoted by Methods with no per-sample objectives Naturally, we can realize any method that only employs a per-token pre-training objective within our framework simply by setting λ SI = 0. This realization is trivial and offers no insight into the suitability of these pre-training methods for downstream per-sample tasks.
Methods with a supervised, single-task per-sample objective (e.g.,

BERT [35])
A simple, single-task, per-sample, classification pre-training objective induces a geometric constraint in the output latent space on the basis of the inner product "distance" between samples of the same vs. different class labels. We can use this observation to realize a reduction from a valid SIPT objective to the original classification objective. In particular, we can introduce a graph G = ({x i ∈ X}, {(x i , x j )|y i = y j }) which consists of cliques corresponding to each unique label c ∈ Y. Then, leveraging any structure-preserving metric learning loss with a cosine distance objective will, at optimality, recover a solution that also satisfies the original classification objective, where we use centroids of the induced clique embeddings to represent class embeddings.
Methods with a supervised, multi-task per-sample objective (e.g., MT-DNN [42]) A slightly more complicated case is when methods employ a multi-task, per-sample classification objective. In this case, there are two ways to realize this task within the SIPT framework. First, we can simply transform the multi-task objective into a single-task objective by constructing a new label-space consisting of the Cartesian product of all label spaces for each task individually. This will greatly increase the number of "labels" in the task, but then the problem can be realized via a graph of disconnected cliques much like in the single-task setting.
However, there is another manner in which we can realize this objective in the SIPT framework; In particular, suppose our collection of tasks consists of k label spaces: Y = Y 1 × · · · × Y k . Then, we can construct a graph G = (V, E) such that: 1. the vertices consist of all pre-training samples x i as well as auxiliary nodes corresponding to each label c the edges contain links between each sample x i and label y Then, we can see that if we solve the SIPT problem under a structure-preserving metric learning loss, we will naturally have produced embeddings for each x i which are close (in innerproduct distance space) to the class embeddings corresponding to their labels for each task, while they are also far from other, non-matching class embeddings, as desired. This second approach is more useful to us in considering the ramifications of this style of constraint because it enables us to make more rigid theoretical guarantees via the SIPT theory.
Methods with a based contrastive per-sample objective (e.g., GraphCL [54]) It is challenging to realize contrastive learning approaches within the SIPT framework, but it is still possible. Here, we highlight two distinct types of contrastive learning approaches we can capture within SIPT: a noising/augmentation-based approach, in which sample embeddings are constrained to be similar to embeddings of noised versions of said samples; and a multi-modal (or multi-lingual) contrastive approach, in which there exists a 1:1 mapping between two different sub-modalities within X which is used to join those two modalities into a unified latent space (e.g. a model which constrains embeddings of English sentences to be close to embeddings of their french translations, but far from unrelated sentences).
To consider the augmentation/noising policy type first, let h : x i →x i represent the noising transformation. Then, to build an analogous SIPT model to this model, we construct an augmented dataset consisting of all original data points alongside all possible transformed versions of the original data points under h: Note that even in contexts where h is continuous (and thus has an infinite image), we can still construct this dataset in practice because training is only performed over a finite number of steps, meaning our augmented dataset X need only be expanded to cover a finite number of augmentations. Then, the associated pre-training graph is simple; we simply use every sample in the augmented dataset X as a vertex and connect any two samples if and only if one is a transformed version of the other. This forms a graph of many disconnected stars (one star for each original datapoint x i ), and thus it does not directly enforce any particular geometry via our current theory. However, in cases where dataset size is sufficiently large, h sufficiently expressive, and data density sufficiently high, then the natural continuity of any neural network model will induce additional, auxiliary connections across these stars (if, for example, the noised versions of two distinct samples have a high probability of being very similar), which increases the depth of the geometric constraints enforced. Quantifying the exact parameters of these interactions, however, we leave to future work.
In the case of the multi-modal/multi-lingual contrastive alignment objective across k modalities, our setup is much simpler: we simply let G PT be a k-partite graph whose samples consist of individual data points (across all modalities) and edges connect samples that compose a matching pair across modalities (e.g. edges link English sentences to their french translations). The extent to which this constrains the output geometry in practice, then, comes down to several questions: (1) Is the cross-modal alignment a one-to-one, one-to-many, or many-to-many alignment (which impacts the geometry of the resulting graph), (2) How large and dense is the dataset (which impacts the extent to which additional, indirect edges will be induced due to continuity in practice), and (3) How do other pre-training objectives constrain the individual modalities separately? In a case where this graph is one-to-one, and no other constraints are induced in each modality separately, this objective will offer only minimal constaints as the resulting graph will consistent of many disconnected 2-cliques.

Multi-similarity loss
The multi-similarity loss, parametrized by w + , w − , and t, is given below:

Contrastive loss
Our contrastive loss is modeled after [89]'s version. For this loss, we assume we are given the following mappings: 'pos', which maps x into a positive node (i.e., linked to x in G PT ), and 'neg', which maps x into a negative node (i.e., not linked to x in G PT ). The union of a seed minibatch B of points X B and its images under 'pos' and 'neg' mappings form a full minibatch. This loss is specified by the positive and negative margin parameters µ + and µ − as:

Additional Choices within the SIPT Framework
In addition to a loss term, we can use negative sampling to improve efficiency. Using the full graph G PT , which is not available in many contexts where negative sampling is employed, we can leverage the distance between samples calculated on G PT , which provides a complementary source of information beyond embedding space distance alone. For example, one could use this to limit negative samples within the same connected component, but more complex strategies based on graph sampling (e.g. [95]) could also be used.

Proof of Theorem 1
We begin by defining the notion of "Local Consistency," which (informally) quantifies how "smooth" a given fine-tuning task label is over a graph G PT (Definition 5). In addition, note that throughout all proofs, we will assume that the PT and FT datasets are iid, that FT tasks, though they may be unobserved over PT samples, are well defined over the entire PT and FT domain and thus true labels do exist (though they may be unknown) for PT samples, and that the sampling distribution of the PT/FT data has full support over the label-space of any considered task.
Definition 5 (Local Consistency). Let y : X → Y be a task over a domain X, and let G = (V, E) be a graph such that X ⊆ V . The local consistency LC G (y) is the probability that a node's label y(x) agrees with the majority of labels of x's neighbors in G: Note this is closely related to homophily [96][97][98].
With Local Consistency defined, we can now formally prove Theorem 1, reproduced below.
Theorem 1. Let X PT be a PT dataset, G PT be a PT graph, and let f θ * be an encoder pre-trained under a PT objective permissible under our framing that realizes a L SI value no more than * .
Then, under embedder f , the nearest-neighbor accuracy for a FT task y converges as dataset size increases to at least the local consistency (Definition 5) of y over G PT .
Proof. Given f realizes SIPT-optimal embeddings, we know that if we define a r-NN predictor via the same radius r * at which f achieves optimality, then this predictor will be correct exactly as often as the label of a given node in the graph G PT agrees with the labels of its neighbors-which is LC G PT (y). This classifier may not be well defined for small FT dataset sizes. However, as if data is not sufficiently dense, there may be no data points within the radius r of a given query. Similarly, without sufficient PT data, the LC computed over the empirical distribution of the graph G PT may be a poor proxy for the true distribution. As PT and FT dataset sizes increase, however, we can achieve at least this performance. We may be able to achieve even higher performance if other effects motivate stronger performance at radii smaller than r * , but this is not guaranteed.
Proof of Corollary 1 Corollary 1. Let X PT ∈ X N , be a PT dataset with corresponding labels y ∈ Y N PT . Define Then, the local consistency for a given FT task y (FT) over G PT (and thus by Theorem 1, the nearest-neighbor accuracy for any optimized SIPT embedder) is upper bounded by the probability that a sample x i 's fine-tuning label y (FT) i agrees with the majority class label for task y (FT) over the clique consisting of all nodes with the same pre-training label y i as x i .
Proof. This follows directly from the definition of Local Consistency, G PT , and the law of total probability. In particular, With Local consistency found, a simple application of Theorem 1 completes the proof.
Note that this has a dependence on the PT dataset size as the probabilities P are taken over the empirical distribution induced by the dataset X PT and graph G PT inherent in local consistencyif X PT is too small, these empirical distributions will be poor proxies for the true distribution and this bound will not hold tightly. However, once saturation is reached, it will not improve beyond this fixed upper bound relating to task correlation.

Proof of Corollary 2
Corollary 2. Let X PT be a PT dataset that can be realized over a valid manifold M. Assume X PT is sampled with full support over M. Let G PT (X PT , E) be an r-nearest-neighbor graph over M (e.g., (x i , x j ) ∈ E if and only if the geodesic distance between the two points on M is less than r: D M (x i , x j ) < r). Let y (FT) be a FT classification task that is almost everywhere smooth on the manifold.
Then, as PT dataset size (and thus the size of G PT ) tends to ∞, and r tends to zero, the local consistency of y (FT) over G PT (and thus by Theorem 1 the nearest-neighbor accuracy of an SIPT embedder) will likewise tend to 1.
Proof. As r → 0, provided PT dataset size increases at a sufficient associated rate so as to maintain a constant minimum degree of G, we have the property that the total diameter over M contained in a node's local neighborhood within G PT likewise decreases. Given some fixed node x ∈ M that is within the interior of a set of constant y FT label, this implies that, eventually, it will grow sufficiently small that all of x's neighbors share the same label as x under y FT .
More concretely, it is clear that this point will occur exactly when r is the geodesic distance between x and the boundary of the surrounding constant-label patch containing x. But, it is clear that the only sections of M will not have the property that neighborhoods around points will be constant w.r.t. y FT labels will almost everywhere be patches within distance r of the points where y FT changes.
This implies that as r → 0, then almost everywhere will the neighborhoods around a node x be constant w.r.t. y FT . However, this implies that almost everywhere would y FT display perfect local consistency, as desired.

Semi-synthetic Experiments Validating Theoretical Results
We can further validate the theoretical analyses of our framework via semi-synthetic experiments. In particular, we create several datasets of natural language sentences augmented with synthetic graphs with known relationships to certain FT tasks (e.g., low or high local consistency, low or high rates of noise). We then use these datasets to validate three important properties of PT methods: First, do PT methods trained with a L SI and G PT yield Nearest-neighbor FT performance in accordance with our theory? In particular, do (a) FT tasks with high local consistency over the PT graph offer better performance, and (b) those with very low local consistency offer worse performance? Second, do PT methods trained with a L SI and G PT suffer significantly when pre-training graphs are polluted with noise? Finally, third, do the latent space geometry regularizing properties of L SI yield methods whose embeddings more clearly cluster than embeddings produced by traditional pre-training alone?
Topics were assigned to these sentences by running Latent Dirichlet Allocation via Scikitlearn [99] over a Bag-of-words representation to 100 topics, with otherwise default parameters. Given the probabilities over all 100 topics, we treated the prediction of the most probable topic as a 100-class multi-class classification problem for our FT task in these experiments.
To test across various graphs, we produce a number of pre-training graphs per experiment, as detailed below.
Pre-training graphs We use graphs spanning 3 categories. (1) A graph (CLIQUES) consisting of disconnected cliques, where sentences are linked in the graph if they share the same topic label. (2) Graphs composed of nearest-neighbor graphs defined over simplicial manifolds built using topic probabilities to lo-calize sentences onto simplices. We explore manifolds with a range of topological complexity, including: PLANE, MÖBIUS, SPHERE, and TORUS. Finally, (3) we define three graphs according to a mechanistic process that allows us to control how topic labels relate to graph structure: first, so that topics are maximally conserved within local neighborhoods (NEIGHBORHOOD); second, by assigning sentences to nodes in the graph such that each graph motif corresponds to a unique topic (MOTIF); and third, such that node topics are driven by non-local graph structural features, on the basis of graphlet degree vectors (STRUCTURAL). Details for each pre-training graph formation are given below.

CLIQUES Graph Setup
To construct the Cliques graph setting, we choose a random subset of sentences as X PT and define G PT = (X PT , E) such that (x i , x j ) ∈ E if and only if x i and x j share the same topic label.

PLANE, MÖBIUS, SPHERE, & TORUS Graphs
For these graphs, we take a more involved practice to localize sentences onto specifiable simplicial manifolds, then construct pre-training graphs via radius nearest neighbor graphs on those manifolds. This involves several steps: Localizing Sentences on Simplices We can localize any sentence in our overall dataset onto a 2simplex by mapping them onto the (re-normalized) probabilities associated with their top-3 topics. Doing this means that the simplex on which they are localized has vertices corresponding to possible topics among our 100 total topics.
Stitching Topic-simplices Into Manifolds Given these topic-simplex localized sentences, we need to construct our manifolds. To do so, we first produce any arbitrary simplicial tiling of a 2-manifold. With this tiling, all that remains to localize sentences onto the manifold is to find a self-consistent mapping of topics to simplex vertices (in the tiling) such that all topic-simplices induced by this mapping have sufficiently many associated samples to enable roughly uniform sampling.

Sampling Points
After finding a self-consistent map of topics to simplicial tiling vertices that satisfy density requirements, we can sample sentences onto the manifold. To make this process more uniform, we also calculate the relative entropy of each sentence (over the renormalized probabilities of the top-3 topics), bin those entropies into buckets, then sample first what entropy bucket we wish to draw from such that the induced distribution of sentence entropies is approximately uniform, then sample within that entropy bucket.
Calculating on-Manifold Distances Finally, with sentences sampled and localized onto a simplicial manifold, we then need to compute approximate geodesic distances to enable building radius-nearest-neighbor graphs over these sentences. To do so, we use an approximate algorithm that considers only on-simplex distance (e.g., it does not consider any curvature penalties) which is equivalent to calculating the distance between any pair of points over the simplices presuming they were flattened onto a plane (this flattening naturally does not preserve manifold topology, but along only the shortest path between any particular set of two points it is always possible to do so with a 2-manifold).
The above process describes how to produce a radius-nearest-neighbor graph for any specifiable manifold using our topic-model outputs. We do this for simplicial manifolds that correspond topologically to a simple plane (PLANE), a möbius strip (MÖBIUS), a sphere (SPHERE), and a torus (TORUS).

STRUCTURAL, NEIGHBORHOOD & MOTIFS Graphs
In order to form these examples, we must (1) define our overall graphs, (2) featurize these graphs in a manner that is reflective of different forms of graph structure, then (3) use these featurizations to assign sentences to graph nodes to form our pre-training dataset.
Graph Construction We sample graphs by first building a base cycle of a parametrized size, then add motifs along this cycle by sampling small graphs from all possible connected graphs of size less than 6 nodes.
Node Featurization Nodes in this graph are then assigned internal features based on three notions of graph topology. For the "Neighborhood" label, a node n is identified according to an index-vector indicating which nodes in the graph are within shortest-path distance 3 of n.
For the "Motif" label, n is identified based on its membership either in the base cycle or any of the attached random subgraphs. For the "Structural" label, n is identified based on its graphlet degree vector (of order 4). For structural and homophily features, categorical labels are then produced by feeding these raw representations through a k-means clustering algorithm.

Sentence Assignment
We assign sentences to nodes in multiple ways so that we can produce datasets that reflect each of the notions of graph structure discussed previously. In particular, for either the neighborhood, motif, or structural labels, each sentence topic is matched to a node label, then sentences are assigned randomly to nodes in the graph with a matching topic label. Note that this produces a dataset where the graph structure is only partially reflected by the node's features, which is itself another useful test of the SIPT method, as it would not be useful if SIPT could only capture data in contexts where the graph was perfectly reflected by the node features themselves.

Expected local consistency between graphs G PT and the topic prediction FT task
Of all these graphs, we expect that topics will display a low local consistency over the STRUCTURAL graph and a moderately high local consistency over the MOTIF graph (as graph motifs are all connected components), and high local consistency everywhere else.

Network Architecture & Hyperparameters
The Cliques and Mechanistic experiments use a shallow Transformer model with 2 layers and 10 hidden units. The Manifold experiments use a 3-layer Transformer model with 256 hidden units. Hyperparameters were not tuned but were chosen by hand to produce as small a network as possible while permitting reasonable learning dynamics.
Experimental setup To answer our three questions, we will pre-train models under both traditional LM pre-training alone and a new, structure-inducing PT (SIPT) method within our paradigm that augments the loss with a contrastive learning loss over G PT , with λ SI = 0.1. Both models use a shallow transformer encoder for f θ and a character-level tokenization scheme. Final results are reported via the AUROC of 3-nearest-neighbor classifiers over the latent space, per-sample embeddings. In line with our theoretical predictions, we expect to see higher NN FT performance in all settings where the FT task (topic prediction) has high local consistency over the graph G PT (all graphs except STRUCTURAL) and worse performance in the case where the local consistency is very low (STRUCTURAL).
We also assess the stability of our method as the graph G PT is noised using the CLIQUES graph by randomly adding additional edges with varying rates.
Semi-synthetic Result 1: SIPT improves performance over LM PT by 0.26 ± 0.13 AUROC on graphs where the topic task has a high local consistency As can be seen in Figure 5a, SIPT offers significant improvements over LM PT in nearest-neighbor FT AUROC across all graph types with strong topic local consistency.
Semi-synthetic Result 2: SIPT's empirical results are in agreement with theoretical findings In line with our theoretical findings, SIPT only under-performs LM PT on the STRUCTURAL graph where the topic task (by design) does not have strong local consistency. This validates our theoretical results by showing that local consistency strongly predicts Nearest-neighbor FT performance. Figure 5b shows Nearest-neighbor FT AUROC as a function of noise rate on the CLIQUES graph. For up to 15% noise, SIPT shows improvements over LM PT, and even at 50% noise, the two approaches perform comparably. Figure 5c-d shows embeddings produced under the MÖBIUS graph either by LM PT or SIPT, clustered via UMAP into 2 dimensions. It is clear visually from these figures that SIPT embeddings show clear clusters strongly associated with the topic-modelling FT task, whereas LM PT embeddings do not.

Conclusions
From these analyses, we see that augmenting PT with per-sample structure-inducing objectives can both (1) offer significant advantages over existing PT architectures and (2) permit analytical reasoning about which FT tasks PT will offer improvements. These findings are not surprising; in these semi-synthetic experiments, we designed our graphs explicitly to have either high or low local consistency with respect to our FT task so that we could probe exactly whether SIPT methods would behave in accordance with theory in tightly controlled settings. In this way, the graphs G PT used here may not be reflective of graphs in the real world, which will be chosen more independently of specific FT tasks. To address this, in the Results section, we demonstrate experimental results over diverse real-world datasets with real, FT-task-independent graphs to show that the gains persist in more realistic scenarios.

Further Details on Real-world Experiments
Further Details on the PROTEINS Dataset and FT tasks PT Dataset We use a dataset of ∼1.5M protein sequences from the Stanford Tree-of-life dataset [74] (https://snap.stanford.edu/tree-of-life/data.html). The associated Github repository for this resource lists an MIT license.
PT Graph Two proteins are linked in G PT if and only if they are documented in the scientific literature to interact, according to the tree-of-life interaction dataset. This is an external knowledge graph.
FT Dataset/Tasks We use the TAPE FT benchmark tasks [15], including Remote homology (RH), a per-sequence classification task to predict protein fold category (metric: accuracy); Secondary structure (SS), a per-token classification task to predict amino acid structural properties (metric: accuracy); Stability (ST) & Fluorescence (FL), per-sequence, regression tasks to predict a protein's stability and fluorescence, respectively (metric: Spearman's ρ); and Contact prediction (CP), an intra-sequence classification task to predict which pairs of amino acids are in contact in the protein's 3D conformation (metric: Precision at L/5).
Baselines We compare against the published TAPE model [15], which uses an LM task alone as our per-token comparison point, and the PLUS [45] model, which optimizes for LM and supervised classification jointly, for our per-sample comparison point.
The tasks in the TAPE benchmark [15] on which we test are described more fully below. All these datasets are publicly available. All datasets can be obtained directly on TAPE's Github (https://github.com/songlab-cal/tape#data), which lists no licenses for these datasets though the overall Github is released under a BSD 3-Clause "New" or "Revised" License.
Remote Homology This is a per-sequence, multi-class classification problem, evaluated using accuracy, which tasks a model to predict a protein fold category at a per-sequence level. This task's dataset contains 12,312/736/718 train/val/test proteins and is originally sourced from [100].
Secondary Structure This is a per-token, multi-class classification problem, evaluated using accuracy, which tasks a model to predict the structural properties of each amino acid in the final, folded protein. This task's dataset contains 8,678/2,170/513 train/val/test proteins, and is originally sourced from [101].
Stability This is a per-sequence, continuous regression problem evaluated using the Spearman correlation coefficient, which tasks a model to predict the protein's stability in response to environmental conditions. This task's dataset contains 53,679/2,447/12,839 train/val/test proteins, and is originally sourced from [102].
Fluorescence This is a per-sequence, continuous regression problem evaluated using the Spearman correlation coefficient, which tasks a model to predict how brightly a protein will fluoresce. This task's dataset contains 21,446/5,362/27,217 train/val/test proteins, and is originally sourced from [103].

Further Details on the ABSTRACTS Dataset and FT tasks PT Dataset
We use a dataset of ∼650K free-text scientific article abstracts from the Microsoft Academic Graph (MAG) dataset [75,76]. The ABSTRACTS PT data (the Microsoft Academic Graph dataset) is licensed with an Open Data Commons Attribution License (ODC-By) v1.0 license.
PT Graph Two abstracts are linked in G PT if and only if their corresponding papers cite one another. This is a self-supervised graph.
FT Dataset/Task We use a subset of the fine-tuning tasks used in the SciBERT paper [91], including Paper field (PF), SciCite (SC), ACL-ARC (AA), and SciERC Relation Extraction (SRE), all of which are per-sentence classification problems (metric: Macro-F1). PF tasks models to predict a paper's area of study from its title, SC & AA tasks both predict an "intent" label for citations, and SRE is a relation extraction task.
Baseline We compare against the published SciBERT model [91] as our per-token comparison and lack an associated per-sample comparison as we don't know of any published per-sample models in the academic papers modality.
The tasks in the SciBERT benchmark [91] on which we test are described more fully below. All tasks here are per-sentence, multi-class classification problems (i.e., we do not study any pertoken tasks), and all are evaluated in Macro-F1 (out of 1). All FT datasets can be obtained from the SciBERT Github (https://github.com/allenai/scibert), which lists no dataset-specific licenses but is released with an Apache-2.0 license.
Paper Field This problem asks models to predict a paper's area of study given its title. This task's dataset contains 84,000/5,599/22,399 train/val/test sentences. Though the original dataset is derived from the MAG [75], it was formulated into this task format by SciBERT directly [91].
SciCite This problem tasks models to predict an "intent" label for sentences that cite other scientific works within academic articles. This task's dataset contains 7,320/916/1,861 train/val/test sentences, and is originally sourced from [104].
ACL-ARC This problem tasks models to predict an "intent" label for sentences that cite other scientific works within academic articles. This task's dataset contains 1,688/114/139 train/val/test sentences and is originally sourced from [105].

Further Details on the NETWORKS Dataset and FT tasks PT Dataset
We use a dataset of ∼70K protein-protein interaction (PPI) ego-networks here, sourced from [43]. Each individual sample here describes a single protein, realized as a biological network (i.e., an attributed graph) corresponding to the ego-network about that protein (i.e., a small subgraph containing all nodes within the target protein) in a broader PPI graph. Unlike our other domains, this domain does not contain sequences. The NETWORKS PT dataset releases its code and dataset files under an MIT license.
PT Graph The dataset from [43] is labeled with the presence or absence of any of 4000 protein gene ontology terms associated with the central protein in each PPI ego network. Leveraging these labels, two PPI ego-networks are linked in G PT if and only if the Hamming distance between their observed label vectors is no more than 9. This is an alternate-representation nearest-neighbor graph.
FT Dataset/Tasks Our FT task is the multi-label binary classification of the 40 gene-ontology term annotations (metric: macro-AUROC) used in [43]. We use the PT set for FT training and evaluate the model on a held-out random 10% split.
Baselines We compare against both attribute-masking [43] and multi-task supervised PT.
The Networks FT task is a multi-task, binary classification task. Recall that the dataset here consists of PPI ego-networks, which means that an individual sample input to the model is an attributed graph x which contains a central node, corresponding to a protein, along with the ego-graph surrounding that node in a larger PPI graph. This ego-graph can thus be seen to correspond to the central protein, and the FT and PT tasks leverage this association, as both of which flag whether or not that central protein is associated with particular gene-ontology (GO) terms (annotations relating to protein properties or function applied in the literature). The PT tasks contain 4000 possible GO annotations, but the FT tasks correspond to a smaller set of only 40 GO terms, chosen as they were of greater interest than the full set. See the original source ( [43]) for more information and full details.

Further Details on Experimental Procedure
To minimize computational burden, we do not pre-train a structure-inducing model from scratch for PROTEINS and ABSTRACTS datasets. Instead, we initialize a model from the per-token baseline directly, then perform additional pre-training for only a small number of epochs under the new SIPT loss subdivision. We assess both multi-similarity and contrastive L SI variants in these domains. On the NETWORKS dataset, we pre-train all models (including baselines) from scratch, and based on early experimental results, we only assess the contrastive loss variant.

Further Details on Ablation Studies
Note that the warm-start procedure described above on the PROTEINS and ABSTRACTS domains allows a powerful ablation study: by additionally training a PT model from the per-token baseline with λ SI = 0, we can uniquely assess the impact of the new loss term, rather than simply additional training or the different PT dataset. We perform this ablation study for all applicable datasets. For the NETWORKS dataset, no additional ablation studies are needed to assess the impact of the loss term, given all models are trained from scratch with the same early-stop procedures.
Further Details on Choosing λ SI For the PROTEINS and ABSTRACTS dataset, to choose the optimal value of λ SI for use at PT time, we pre-trained several models and evaluated their efficacy in a link retrieval task on G PT = (V, E). In particular, we score a node embedder f by embedding all nodes n ∈ V as f (n), then rank all other nodes n by the euclidean distance between f (n) and f (n ), and assess this ranked list via IR metrics including label ranking average precision (LRAP), normalized discounted cumulative gain (nDCG), average precision (AP), and mean reciprocal rank (MRR), where a node n is deemed to be a "successful" retrieval for n if (n, n ) ∈ E. In this way, note that we choose λ SI in a manner that is independent of the fine-tuning task and can be determined solely based on the PT data. Final results for these experiments are shown in Methods Table 9 for the proteins dataset and Methods Table 10 for scientific articles.
Ultimately, this process suggests that λ SI of 0.1 is a robust setting, and as such, 0.1 was used directly for the NETWORKS task without further optimization.

Further Details on Architecture & Hyperparameters
The architectures of our encoders for the PROTEINS and ABSTRACTS domains are fully determined from our source models in TAPE [15] and SciBERT [91]. In particular, for proteins and scientific articles, we use a 12-layer Transformer with a hidden size of 768, an intermediate size of 3072, and 12 attention heads. Provided TAPE and SciBERT tokenizers are also used. A single linear layer to the output dimensionality of each task is used s the prediction head, taking as input the output of the final layer's [CLS] token as a whole-sequence embedding. We also tested either pre-training for a single or for four additional epochs, based on validation set performance, and ultimately used a single epoch for proteins and four for scientific articles.
For the NETWORKS domain, we match the architecture used in the original source [43] for the mask model runs. Save that for computational efficiency, we scale the batch size up as high as it can go, then proportionally scale up the learning rate to account for the larger batch size. This corresponds to a batch size of 1024, the learning rate of 0.01, a GCNN encoder type of GIN, embedding dimensions of 300, 5 layers, 10% dropout, mean pooling, and a JK strategy of "last".
Fine-tuning hyperparameters (learning rate, batch size, and the number of epochs) were determined based on a combination of existing results, hyperparameter tuning, and machine limitations. On proteins, most hyperparameters were set to follow those reported for a LM PT model in [106], though additional limited hyperparameter searches were performed to validate that these choices were adequate. As the original source for these hyperparameters was an LM PT model, any bias here should be against SIPT, meaning this is a conservative choice. Early stopping (based on the number of epochs without observing improvement in the validation set performance) was employed, and batch size was set as large as possible given the limitations of the underlying machine. For the PLUS reproduction, we compared hyperparameters analogous to the reported PLUS hyperparameters for other tasks and analogous to our hyperparameters for other tasks and used those that performed best on the validation set. For scientific articles, we performed a grid search to optimize downstream task performance on the validation set, with the learning rate varying between 5e-6 and 5e-5 and the number of epochs between 2 and 5. The same grid search was used in the original SciBERT method. We additionally match the SciBERT benchmark by applying a dropout of 0.1, using the Adam optimizer with linear warm-up and decay, a batch size of 32, and no early stopping. For the NETWORKS, FT hyperparameters were again chosen to match the original source model [43] to save the increase in batch size and learning rate. No additional hyperparameter search was performed.
Final hyperparameters for each downstream task are shown in Tables 5 for proteins and 6

Further Details on Implementation and Compute Environment
We leverage PyTorch for our codebase. FT Experiments and NETWORKS PT were run over various ubuntu machines (versions ranged from 16.04 to 20.04) with a variety of NVIDIA GPUs. PRO-TEINS and ABSTRACTS PT runs were performed on a Power 9 system, each run using 4 NVIDIA 32 GB V100 GPUs with InfiniBand at half precision.

Full Results
Here we provide the raw FT results for all tasks in the PROTEINS and ABSTRACTS domains, respectively (Tables 7 and 8). The NETWORKS domain raw results are already present in the main text ( Figure 3).

SIPT Results are in Accordance with Theory and Guiding Hypothesis
Results over all real-world domains are consistent with our theoretical analyses and guiding hypothesis. We can also analyze the extent to which induced structure helps non-NLP domains by examining the results of our λ SI tuning procedure. In particular, we find that far less structureinducing is necessary on our ABSTRACTS dataset (λ SI = 0.01) than on our PROTEINS dataset (λ SI = 0.1). This agrees with our guiding hypothesis that per-sample latent space regularization is much more necessary on non-NLP domains than on NLP domains.
To demonstrate this, we show the final results for the guiding link-retrieval task for the PRO-TEINS domain in Table 9 and for the ABSTRACTS domain in Table 10. In both settings, we compare the following models.
Random Nodes are embedded with random vectors to assess chance performance.
Initial Model Nodes are embedded with the base pre-trained model we build on in our experiments without further modifications. This model is TAPE [15] for proteins and SciB-ERT [91] for scientific articles.
LM PT Nodes are embedded with the final encoder after additional pre-training on our graphaugmented datasets, but without any SIPT (i.e., λ SI = 0).

CS RoBERTa (for scientific articles only)
Nodes are embedded via [12]'s DAPT CS RoBERTa model, which is another LM PT model over scientific abstracts which performed very well on ACL-ARC, the task on which SIPT does best in scientific articles.
SIPT (for various values of λ SI ). Nodes are represented via SIPT PT models at the specified weighting. For proteins, all SIPT models are initialized from TAPE, but for scientific articles, we test against both initializing from SciBERT and CS RoBERTa (as both are just different, domain-specific LM PT models).
Note that in addition to the discrepancy in the magnitude of improvement (over scientific articles, average precision goes from 12.9% to 14.2%, vs. 2.4% to 3.5% on proteins, which is proportionally much more significant), we can also see that SIPT improves retrieval performance  Table 9: PT set link-retrieval performance for a random baseline, the raw TAPE model, and SIPT for various weighting parameters λ SI on the dataset of protein sequences. LRAP, label ranking average precision; nDCG, normalized discounted cumulative gain; AP, average precision; MRR, mean reciprocal rank. Higher values indicate better performance. Highlighted in grey are realizations of SIPT framework that yield better results than the strongest baseline, providing evidence that incorporating sequence-level relational information into PT (i.e., λ SI > 0) leads to improved performance. over the baselines for proteins much more than it does for scientific articles. This is, admittedly, largely due to [12]'s CS RoBERTa model's surprisingly good performance without any modifications, however as we also compare SIPT pre-trained from a CS RoBERTa model and it does not demonstrate significant improvements, we still feel this is a fair comparison. These findings are consistent with our hypothesis that SIPT will offer more significant advantages in non-natural language domains.  Table 10: PT set link-retrieval performance for a random baseline, the raw SciBERT model, and SIPT for various weighting parameters λ SI on the scientific articles dataset. LRAP, label ranking average precision; nDCG, normalized discounted cumulative gain; AP, average precision; MRR, mean reciprocal rank. Higher values indicate better performance. Highlighted in grey are realizations of SIPT framework that yield better results than the strongest baseline, providing evidence that incorporating sequence-level relational information into PT (i.e., λ SI > 0) leads to improved performance.

A Review of Language Model Pre-training Methods
In this supplementary section, we describe all of the models featured in our review ( Figure 1 and Table 4) and highlight key details of their approach.
A.1 Language modelling alone [1] General domain NLP; ELMO leverages a biLSTM to perform language modelling; unlike later methods, for FT tasks, models do not typically re-train the entire LSTM but rather use a weighted combination of model interior hidden states as (at FT time) static wordembeddings.
[4] General domain NLP; RoBERTa includes only a masked language modelling objective.
[2, 5, 6] General domain NLP; The GPT series of models use autoregressive language modelling alone and focus on generative language tasks, not general PT/FT, though GPT-III does show that by reframing many classical NLP fine-tuning tasks as generative language tasks, GPT-III can still offer a compelling zero and few-shot solution to these tasks using only the pretrained embedder [2].
[7] General domain NLP; BART utilizes a denoising language-model objective across various noising constraints.
[11] General domain NLP; UniLM integrates several different kinds of language modelling, including bidirectional, unidirectional, and sequence-to-sequence LMs. They impose no other PT losses.
[15, 33, 34] Protein sequences; Various methods have explored language modelling alone for protein sequences. One notable entry is the TAPE benchmark, which also introduces a public benchmark of FT tasks for future comparisons.
[26] Molecular Graphs; Molecular Graph BERT (MG-BERT; no relation to MG-BERT [31]) uses masked atom prediction to pre-train a GNN over molecular graphs.
[8] General domain NLP; This paper pre-trains a model for multi-lingual language modelling, using only a multi-lingual masked language modelling objective.
[12] General domain NLP; DAPT advocates for continual pre-training on increasingly task-focused text to improve its relevance to various downstream tasks. DAPT uses a RoBERTa baseline pre-training model, which includes only a masked language modelling objective. It shows significant gains after adaptation. However, as they only adapt the pre-training context to the more focused text, this induces no additional constraints on the latent space geometry.
[10] General domain NLP; SpanBERT changes the traditional masked language modelling task to a task in which contiguous spans are masked wholesale, rather than individual tokens.
A.2 Language modelling & templated tasks/prompting as language modelling [3] General domain NLP; T5 not only performs a robust analysis of various existing pre-training strategies but also introduces the "text-to-text" style of diverse pre-training, in which various downstream NLP tasks can be re-realized as language modelling tasks via templating and prompting, then integrated into language model pre-training alongside unsupervised objectives (such as traditional masked language modelling, albeit realized as a sequence-tosequence task). As they realize all these downstream tasks as additional language modelling tasks, they neither officially produce a directly constrained per-sample embedding nor constrain the geometry of Z beyond traditional masked language modelling.
[24] General domain NLP; CALM builds on ideas from T5 to propose a text-to-text pre-training objective that leverages recognized per-token KG entities from the source text as a generative prompt.
[17] General domain NLP; T0pp extends the architecture of T5 [3] to ingest templated language modelling task from a wide variety of possible input tasks, then evaluates its performance in a zero-shot manner on unseen fine-tuning tasks.
A.3 Language modelling & Per-token KG Integration [13] General domain NLP; ERNIE 1 augments traditional MLM with entity-specific masking (e.g., masking the word "Mozart" from the sentence "Mozart was a musician") to force the model to recover common-sense knowledge about named entities.
[28] General domain NLP; KgPLM adapts the discriminative training ideas of ELECTRA [9] alongside the idea of entity masking explored previously. They perform entity masking and a discriminative loss identifying which tokens were replaced focused on entity replacements.
[22] General domain NLP; ERICA presents a mechanism for leveraging contrastive learning and distant supervision to incorporate external knowledge into a PLM for improving language understanding. ERICA augments MLM with two per-token tasks to ensure the per-token representations within a document reflect the structure of the KG. First, ERICA ensures that the pooled representations of head and tail entities are similar when conditioned on a relation (which is prepended to the document prior to embedding). Second, ERICA ensures that relation embeddings (defined as concatenated head, tail per-token entity embeddings) are similar within and across documents. As both tasks are done on per-token embeddings and never at a per-sample level, this approach induces minimal constraints on the per-sample latent space.
[14] General domain NLP; Know-BERT integrates per-token entity information into an MLM pre-training scheme by performing unconstrained attention over a per-entity knowledge graph (only on pre-identified candidate entity spans), alongside any available entity linking supervision information via direct Named Entity Linking. This has similarities with [25] and [107].
[25] Biomedical domain NLP; KeBioLM integrates a per-token KG into a biomedical language model by augmenting token entity representations with attention lookups into a biomedical KG (regardless of whether the attended entities match a given entity mention in the source text, though they do only apply this on recognized entities). To ensure this attention is meaningful, they perform named entity linking and recognition as auxiliary PT objectives, leveraging the same KG embeddings used during the attention calculation. In doing so, the method incentivizes per-token representations to be similar to their associated entity representations, thus ensuring that the entities are reflected in the attention over the KG. KG embeddings are initialized using Trans-E [108]. Their usage of automatically attending over entities within their language model (without explicit constraints on those matches) is motivated by [107]'s work in [107] and has similarities to Know-BERT [14].
[16] General domain NLP; LUKE performs pre-training using MLM and an entity-specific masking/recognition scheme that is a slight variation on the traditional entity-specific masking [13] proposed. At FT time, they have other knowledge-specific integrations, including specialized query matrices in KQV attention based on attending to either traditional tokens or entities. However, at PT time, LUKE's only modulation over a ROBERTA [4] baseline is an entity masking task.
[20] General domain NLP; COLAKE performs a priori entity linking on the source text, then replaces per-token mentions with entity embeddings, and appends to the input text sub-graphs from a (relational) knowledge graph, including both neighboring mentions and relations in the augmented input text block. This input is then encoded via a transformer that limits attention flow between tokens of different types and trains the entire ensemble with masked language, entity, and relation modelling.
[18] General domain NLP; In this paper, traditional masked language modelling is augmented with an entity-replacement-detection task. Named entity recognition and linking are performed before pre-training, and entity replacements are constrained to be the same type as the true entity.
[30] Knowledge Graph Completion; LP-BERT constructs a specialized pre-training corpus consisting of entity-relation statements from a knowledge graph. This is used in a pre-training context under three pre-training tasks: masked language modelling, masked entity modelling, and masked relationship modelling. All three are per-token, and no per-sample tasks are used at pre-training time.
[32] Multilingual Language Models; UD-PrLM examines multilingual pre-training, and aims to improve it by incorporating universal dependency parse trees into the model. They incorporate a per-token task to align tokens with identified dependency parse tree components, alongside masked language modelling.
A.4 Language modelling, Per-token KG Integration, & Supervised Classification [47,48] General domain NLP; ERNIE 2.0 & 3.0 augments traditional MLM with entity-specific masking (e.g., masking the word "Mozart" from the sentence "Mozart was a musician") as well as a multi-task per-sample task, largely motivated at classifying a block of text based on internal text cohesion (predict the true order of the sentences within an input sample & identify whether the sentences within the input sample are spatial neighbors, come from the same document, or come from different documents). ERNIE 3.0 additionally augments pretraining with a per-token relation-embedding task using cloze-filling as a vehicle to perform relation extraction on pre-specified per-token KGs.
[36] General domain NLP; ERNIE (no relation to [13,47]) uses both architectural and objectivefunction changes to inject per-token knowledge into PT. Specifically, they separately embed all named entities in a sample using the architecture to join contextualized entity embeddings alongside the embeddings of tokens, realizing that entity in the span and performing entity-specific masking. In addition, they simultaneously perform standard MLM and nextsentence prediction in the manner of BERT [35].
[31] General domain NLP; MG-BERT introduces a GCNN layer after BERT token, aggregating token embeddings together over a unified graph consisting both of co-occurrence relationships and knowledge graph relationships.
[23] General domain NLP; JAKET embeds entities by extracting per-token representations of entity texts inside per-entity descriptions, then produces updated KG embeddings via a graph attention network [109]. Those embeddings are then fed into a language model alongside per-token embeddings corresponding to those entities. The entire model is trained according to an MLM objective, plus entity category prediction and relation prediction (only on the entity embeddings extracted from entity descriptions and fed through the GCNN-not on the raw entities within the contextualized text).
[21] Biomedical NLP; BERT-MK introduces a transformer-based subgraph summarization network that produces entity embeddings for dynamically chosen subgraphs of a given knowledge graph. This network is trained via a contrastive triplet-validity objective. These are then fused with per-token embeddings in free-text based on apriori entity-token matching (i.e., named entity recognition and linking must be performed first and separately before using this model).
[37] General domain NLP; Coke is similar to ERNIE [36], JAKET [23], and BERT-MK [21] in that it aggregates entity information by leveraging a GCNN over a restricted dynamic context KG based on token-entity mentions then integrates those augmented embeddings into the per-token embeddings of a BERT-style pretrained model (similar to JAKET and BERT-MK), but also leverages the denoising entity autoencoder task of ERNIE [36]. In addition, in the variant of COKE derived from the BERT model, COKE also employs the next-sentence prediction task introduced in BERT [35].
[41] Medical domain NLP; SMedBERT leverages a complex, multi-faceted loss including MLM, Sentence-order prediction SOP (as introduced in, e.g., ALBERT [40]), and includes pertoken KG information by aggregating token embeddings across KG embeddings (produced via trans-H [110]) corresponding to matching entities and the neighbors of matching entities in the KG. They also include relation and entity masking variations to ensure the PT model learns per-token information corresponding to the KG. This method bares similarity to Coke [37] and JAKET [23]. However, unlike Coke and JAKET, SMedBERT realizes the entity/neighbor matching via a geometric objective, which results in an explicit per-token knowledge graph alignment.
[49] General domain NLP; Dict-BERT focuses on augmenting BERT by concatenating definitions of rare words via a per-token KG integration. They add two additional tasks atop the traditional MLM task. First, a task maximizing the mutual information between a masked rare word (treated as a named entity) and its definition (represented as the per-token embedding of the first mention of the entity in the concatenated definition). Second, a task discriminating valid rare word definition per-sequence embeddings from non rare-word definition embeddings via a classification objective.
[44] Sentiment Analysis; SentiLARE integrates sentiment analysis and labels into pre-training by including word polarity signals during masked language modelling and embedding and augmenting pre-training with a supervised sentence sentiment prediction. Word polarities are determined via an external knowledge base integrated at the per-token level.
[38] Dialogue Modelling; SPIDER augments traditional MLM and NSP pre-training with two tasks specific to dialogue modelling: first, utterance order prediction, in which individual utterances (which are nested within a larger sample) are shuffled and the true order is predicted, and a geometric task ensuring that subject, verb, object triples from the utterances obey a geometric relationship inspired by KG embedding methods.

A.5 Language modelling & Graph link-prediction realized as single-task classification
These methods all employ some variant of a graph link-prediction task over their data. However, they all realize this link prediction task not by enforcing any relationship between independent sample embeddings but rather by concatenating samples corresponding to linked (or unlinked, for negative samples) pairs of vertices in the source graph, then framing the learning problem as a binary or multi-class classification problem over the (now concatenated) single output whole sample embedding. In doing so, they transform the task from one that implies a deep geometric constraint over the output latent space to one that only enforces an intra-sample objective and imposes only a shallow geometric constraint on the per-sample latent space.
[35] General domain NLP; Masked language model plus the binary classification of whether the input text block is sequentially consistent, with samples chosen via true positive pairs vs. randomly joined sentences. This can be seen as a link prediction task over a graph consisting of independent, disconnected "sticks", with each stick corresponding to sentences in the documents in the corpus, in sequential order.
[40] General domain NLP; Masked language model plus the binary classification of whether the input text block is sequentially consistent, with samples chosen via true positive pairs vs. reordered positive sentence pairs. This can be seen as a link prediction task over a directed graph consisting of independent, disconnected "sticks", with each stick corresponding to sentences in the documents in the corpus, in sequential order, with edge direction indicating sequential ordering.
[50] General domain NLP; Masked language model plus the classification of whether the input text block contains sentences from either (1) random documents, (2) a sequentially consistent pair within a single document, or (3) within a pair of sentences within two linked documents according to a document linking graph G. This can be seen as a link prediction/edge classification task over a graph whose nodes are text blocks in the corpus, with two distinct edge modalities. First, to capture sequential consistency within a document, one edge type produces a set of independent, disconnected "sticks", with each stick corresponding to sentences in the documents in the corpus, in sequential order. Second, to capture the document linking graph G, sentences in a document D i are all linked to all sentences in a document D j if and only if documents i and j are linked in G.
[39] General domain NLP; While this model incorporates an interesting per-token syntatic knowledge distillation procedure, at a per-token level it merely leverages BERT's NSP loss [35].
in the model. Note that JAKET [23] also leverages entity descriptions in its per-token encoding. However, these descriptions are (1) extracted via per-token embeddings, using the first mention of the token, not whole-sample embeddings, and (2) integrated back into the original text in a per-token manner, not optimized over directly via geometric constraints as in KEPLER.
[69] Molecules; CK-GNN designs a pre-training scheme for molecular graphs in which a molecular GNN is trained to produce molecule embeddings that obey the similarity structure of a 1-NN graph in a cluster-limited molecular fingerprint space (using the Dice similarity coefficient). Unlike the NLP approaches, this method has no intra-sample (i.e., per-token, where here "token" refers to individual atoms within the molecular graph) pre-training task.
[70] Multi-lingual NLP; Much like KEPLER, XLM-K augments traditional MLM with two tasks that constrain the geometry of the per-sample latent space via a (now multi-lingual) graph of entity descriptions linked to sentences containing said entities. Like KEPLER, as the graph connections here are defined only for entity descriptions and not all free-text, the latent space regularization is only over a limited slice of the space.
[71] General domain NLP/IR; WebFormer designs a pre-training scheme leveraging the structure of DOM trees in HTML pages to impose multiple per-sample and per-sample/per-token hybrid constraints that encourage individual samples to be (a) close to noised versions of themselves based on reordering or masking and (b) to be close to representations of their parent/child nodes in the DOM tree, thus imposing a structural penalty geometrically. By mixing per-sample and per-token tasks, WebFormer even more closely entangles the persample and per-token latent spaces in their model, and this approach bears closer study in other contexts.
A.9 Language modelling & whole-sample augmentation/noising based contrastive objectives [60] General domain NLP; InfoWord incorporates an objective alongside masked language modelling which pushes the whole-sample embedding of a sentence to have high mutual information with various sub-contexts within that sentence and low mutual information with sub-contexts of other sentences.
[56] General domain NLP; DeCLUTR optimizes for masked language modelling alongside a contrastive objective comparing anchor spans to positive spans chosen from within individual samples, contrasted against spans from other samples. This is considered "whole-sample" rather than a per-token contrastive loss as the embeddings of the spans (which can be quite long) are produced via a canonicalized pooling operation used for sentence embeddings.

A.10 Language modelling & multi-modal or multi-lingual contrastive objectives
Note that by viewing multiple data modalities as "augmentations" of the data samples, one can realize these methods (in general) as examples of augmentation-based contrastive learning objectives, such as those used in [92]. However, as these methods are common, we highlight them explicitly here.
[65] General domain NLP; InfoXLM focuses on multi-lingual pre-training, and leverages pertoken tasks. This includes multi-lingual masked language modelling and translation language modelling (i.e., variations on a traditional masked language modelling task). It also incorporates a cross-lingual per-sample contrastive objective that aligns the geometry of the latent spaces across distinct languages. One important nuance is that they use different layer depths to define the latent space for their cross-lingual contrastive objective vs. their pertoken objectives, which is not natively describable in our framework. In addition, as each monolingual corpus lacks any rich, independent per-sample task, any individual monolingual latent space cannot be guaranteed to have any rich structural constraints.

A.11 Language modelling alone with relationally-concatenated samples
These methods concatenate samples together before processing them with a pre-training encoder based on inter-sample relations. This is an orthogonal direction to adding greater per-sample dependencies to pre-training methods than our framework but warrants commentary nonetheless.
[19] Protein sequences; MSA transformers extend protein-sequence language models such that they do not take in as input a single sequence but rather an entire multiple-sequence alignment (MSA) profile. These profiles consist of many sequences corresponding to evolutionary homologs of the same protein. This concatenated input is processed via a sparsified form of axial self-attention, which enables cross-attention between the various aligned sequences. They impose no per-sequence tasks by default in this architecture.
[29] General domain NLP; This theoretical analysis shows that transformers cannot model dependencies between sentences that never appear in the same example during pre-training. To combat this, they propose concatenating samples via inter-sample relations (in particular, via a kNN method) at pre-training time, enabling a greater diversity of cross-attention contexts during pre-training vs. fine-tuning. Thus, while they only use language modelling during pre-training, they speculate that their sample-augmentation procedure helps the model better reason about per-sample information through per-token tasks.
[27] General domain NLP; CDLM proposes to concatenate multiple related documents (leveraging categorical information to cluster documents) together into a single sample prior to performing traditional masked language modelling. To limit the model's complexity, attention is restricted to intra-document for unmasked tokens but allowed to be global for masked tokens.
[53] General domain NLP; REALM uses a latent variable model to learn a relevance score between input text spans and documents in an auxiliary document base. The top-k documents, according to this relevance score, are then concatenated to the input prior to solving the masked language modelling task used during pre-training. In this way, the model learns to join relevant documents from an external knowledge base in accordance with which documents would most improve the masked language modelling objective. In addition, by learning this relevance score, the model introduces an implicit whole-sample structural constraint on the latent space according to the unsupervised clustering induced by relevance assignment.

A.12 Autoencoding & Unsupervised Clustering
[52] General domain NLP; MARGE deviates significantly from the norm by not employing any form of language modelling or other forms of a per-token pre-training task. Instead, it employs only a per-sample contextualized autoencoding objective and an unsupervised persample retrieval step (to provide context for said autoencoding). While this approach does provide a deeper form of a per-sample structural constraint than many other approaches, it is also implicit and has no mechanism for injecting domain knowledge. MARGE is also tested solely on downstream tasks at the per-sample level, so it is unclear if this method would offer reduced benefits for per-token downstream tasks.
A.13 Methods orthogonal to our framework [112] KG-BART is a text-generation model that leverages per-token knowledge after a textencoder to enrich the generated text with information from a textual knowledge graph (in a per-token manner). It is neither used for general pre-training nor does it leverage any additional per-sample constraints.
[113] Text-based Knowledge Graphs; This work produces embeddings of nodes in KGs by combining transformer-based text encodings with graph convolutional network KG embedding methods, leveraging link prediction as the pre-training task. Entity descriptions / textual features represent the individual nodes. Link prediction can be seen as inducing a geometric constraint via the connectivity of the knowledge graph on whole-sample embeddings. However, given that relationships are used in encoding the data as well, GraphFormer cannot be used in a context where KG links may not be observed at FT time. It should be seen not as a general text PT method but as an advanced KG embedding mechanism, so it does not directly fall under our framework.
[114] KeLM (unrelated to KELM [115]) is a method for converting a free-text KG into textual nodes so language modelling can be used over that corpus and is orthogonal to the methods of pre-training.
[79] This paper is a method for populating a KG from free-text via BERT. It has no bearing on incorporating structure or knowledge into PT and is irrelevant to our framework.
[116] This paper presents a method to drop redundant triples from a knowledge graph and a regularization technique to limit the impact of added irrelevant knowledge to per-token knowledge-enhanced PT methods such as ERNIE [36].
[117] Knowledge Graph Completion; KG-BERT is a method for knowledge graph completion in which textual representations of entities and relations in KGs are embedded by fine-tuning a pre-trained BERT style transformer for link prediction over a given KG. As this is only for knowledge graph completion, it is orthogonal to our study of pre-trained models in general.
[118] Knowledge Graph Completion; Much like KG-BERT, SimKGC is a method for knowledge graph completion that fine-tunes a BERT model via a contrastive loss over a fixed knowledge graph for link prediction. Though their methodology overlaps with ours in that both use variants of contrastive losses and SimKGC explores more complex negative sampling strategies, the two methods are still very different. Ours is focused on general pre-training and uses a single encoder and a unified latent space. In contrast, SimKGC is only examined for KG completion and encodes head and tail entities via separate encoders.
[119] Event Extraction (EE); CLEVE designs a pre-training method specifically for event extraction. Their pre-training method includes a text-encoder which includes a cross-event contrastive loss pushing individual tokens from the same "event" closer together than those from different events, which bears a surface similarity to our approach. In addition, they add a graph encoder over the semantic structure of events. Their methodology is focused solely on EE, which is orthogonal to our more general PT framework.
[120] General domain NLP and Computer Vision; ViLT is a method for pre-training aligned textimage pairs. It leverages masked language modelling, an image-text matching binary classification objective, and a contrastive objective comparing image and text representations. This multi-modal contrastive objective is very similar (insofar as it relates to our framework) to those works that perform multi-lingual or other multi-modal contrastive methods. In ViLT, however, the transformer architecture processes images and text jointly in a single encoder, node embeddings will rely on connectivity information, which is not permissible in our pretraining context. So, this method is orthogonal to our study here.
[127] Expert Matching; CODE is a method specifically and exclusively designed to discover appropriate experts in an employment/contracting setting and is thus orthogonal to our framework, which is focused on more general pre-training.
A.14 Methods that only change things at FT time [128] Biomedical domain NLP; MOP does not change anything at PT time but trains sub-KG adapters on entity recognition tasks prior to FT to infuse entity knowledge into the PT method. It is a per-token pre-training method.
[129] General domain NLP; K-BERT, at PT time, is actually equivalent to BERT [35]. However, it does do other interesting things at FT time, including augmenting the sentence flow with injected per-token knowledge graphs and limiting self-attention to only flow along links supported by the original sentence or the injected knowledge. However, as this is only true at FT time, it is equivalent to BERT at PT time.
[130] General domain NLP; This model, at PT time, is equivalent to BERT [35]. Like [129]. However, it specializes in a fine-tuning procedure for sentence information retrieval tasks, similar to how PT is adapted in this framework.
[131] General domain NLP; ConSERT adds an auxiliary specialization stage after pre-training to fine-tune sentence representations. This new stage imposes a SimCLR [132] style dataaugmentation/noise-invariance based contrastive learning objective, using adversarial perturbations, token shuffling, token/feature/span erasure, and dropout noising methods.
[133] General domain NLP; IS-BERT does not modify anything from traditional BERT at pretraining time. However, they add a second PT stage to optimize sentence representations alone using an auxiliary feature extractor in the form of various CNNs applied atop BERT token representations. The final sentence representation is trained to maximize mutual information with various sub-contexts within the sentence but low mutual information with other sentences. In this second pre-training stage, there is no language modelling performed. As this approach only adapts an auxiliary featurizer to produce sentence encodings and is not intended for general transfer learning, it is inappropriate for our framework. A similar work that integrates both components during pre-training, and thus is relevant in our work is [60] and is discussed above.
[115] General domain NLP; KELM does not modify PT objective but instead enhances a model at FT time by injecting per-token knowledge via a GNN module atop the pre-trained LM [140] Language modelling; kNN language models improve the text generation powers of language models by augmenting traditional decoding with a nearest-neighbor lookup operation over a text datastore leveraging the embeddings of a token's leftward context by the language model to judge nearest neighbors. However, it involves no additional language model training and can only be applied at the fine-tuning time to aid in text generation, and is thus out of our scope.
[141] Sentence embedding; NT-Xent proposes a secondary specialization stage after pre-training only for generating sentence embeddings. To do this, they employ a contrastive objective contrasting the final CLS embeddings of an updating, specialized BERT model against a pooled aggregate of the per-token embeddings across all layers of the pre-trained BERT model used to initialize the specialized sentence embedding model. [73,142,143] Sentence Embedding; These methods propose to use unsupervised per-sample smoothing operations (a normalizing flow network in [73] and a mean/covariance standardization whitening operation in [142,143]) on the per-sample embeddings after pre-training in order to produce higher quality per-sample embeddings.
[92] General domain NLP; SimCSE extends traditional MLM by imposing a second pre-training stage for optimizing sentence embeddings. In this stage, SimCSE optimizes the transformer such that the whole-sample embeddings satisfy either a supervised or unsupervised contrastive learning objective. In the supervised case, this is based on labeled sentence pairs according to a Natural Language Inference (NLI) task, with entailment pairs being treated as positives and contradiction pairs as hard negatives. In the unsupervised case, this is based solely on applying multiple dropout masks to the same sentence to generate positive pairs. Any two distinct sentence inputs are treated as negative samples. This extra pre-training stage is applied to a relatively small number of samples (10 6 ) relative to the entire PT cohort, which may help prevent catastrophic forgetting of the original pre-training objective.
[144] Academic NLP; SPECTER extends traditional language model pre-training by imposing a second pre-training stage for optimizing document embeddings (realized as [CLS] token embeddings of concatenated academic paper titles and abstracts). This stage uses a tripletbased geometric loss to ensure that these per-sample embeddings reflect the structure of a pre-specified citation network. This is a form of an explicit, structural constraint; however, they do not ever test fully fine-tuning the SPECTER model in their paper and only compare it against other, frozen pre-trained language models. This is likely to have a significant impact on model comparisons. Similar to SimCSE [92], this extra pre-training stage is applied to a small number of samples (146K documents) to help prevent catastrophic forgetting of the original pre-training objective.
[125] General domain NLP; This paper introduces a second pre-training stage after multi-lingual masked language modelling. In this second stage, hyperlinks in the source text (drawn from Wikipedia) are matched via single-task classification to a curated set of destination URL categories, collapsing all URLs pointing to the same Wikipedia page across languages into one. They do this classification in several ways, including incorporating the per-sample representation of the text span rather than merely the hyperlink token representations themselves (likely motivated by the likelihood of only a single hyperlink being present in the source text). We can realize this task as instances of several other common paradigms: (1) Singletask classification applied to the per-sample representation, (2) link prediction in a graph linking cross-lingual Wikipedia pages together, or (3) as an example of named entity recognition. This second stage is only allowed to modify the last two layers of the transformer architecture, which may be a vehicle to prevent catastrophic forgetting.
[145] Sentiment Analysis; SAKG-BERT augments a pre-trained language model with a sentimentanalysis knowledge graph at the fine-tuning time only by concatenating relevant relationships from the KG based on sentiment-laden terms appearing in the review to the raw input text. They do not otherwise change the pre-training or fine-tuning process.