Surrogate- and invariance-boosted contrastive learning for data-scarce applications in science

Deep learning techniques have been increasingly applied to the natural sciences, e.g., for property prediction and optimization or material discovery. A fundamental ingredient of such approaches is the vast quantity of labeled data needed to train the model. This poses severe challenges in data-scarce settings where obtaining labels requires substantial computational or labor resources. Noting that problems in natural sciences often benefit from easily obtainable auxiliary information sources, we introduce surrogate- and invariance-boosted contrastive learning (SIB-CL), a deep learning framework which incorporates three inexpensive and easily obtainable auxiliary information sources to overcome data scarcity. Specifically, these are: abundant unlabeled data, prior knowledge of symmetries or invariances, and surrogate data obtained at near-zero cost. We demonstrate SIB-CL’s effectiveness and generality on various scientific problems, e.g., predicting the density-of-states of 2D photonic crystals and solving the 3D time-independent Schrödinger equation. SIB-CL consistently results in orders of magnitude reduction in the number of labels needed to achieve the same network accuracies.


I. INTRODUCTION
In recent years, there has been increasing interest and rapid advances in applying data-driven approaches, in particular deep learning via neural networks, to problems in the natural sciences [1][2][3][4]. Unlike traditional physics-informed approaches, deep learning relies on extensive amounts of data to quantitatively discover hidden patterns and correlations to perform tasks such as predictive modelling [4,5], property optimization [6,7] and knowledge discovery [8,9]. Its success is thus largely contingent on the amount of data available and a lack of sufficient data can severely impair model accuracy. Historically, deep learning applications have overcome this by brute-force, e.g., by assembling vast curated data sets by crowd-sourced annotation or from historical records. Prominent examples include ImageNet [10] and CIFAR [11] in computer vision (CV) and WordNet [12] in natural language processing (NLP) applications. The majority of problems in the natural sciences, however, are far less amenable to this brute-force approach, partly reflecting a comparative lack of historical data, and partly the comparatively high resourcecost (e.g. time or labor) of synthesizing new experimental or computational data.
Recently, to alleviate the reliance on labeled data, techniques like transfer learning and self-supervised learning have emerged. Transfer learning (TL) [13][14][15][16] refers to the popular approach of fine-tuning a neural network which has been pretrained on a large labeled source dataset for a target task. TL is widely adopted in CV and NLP applications; prominently, models pre-trained on ImageNet [10] such as VGGNet [17] and ResNet [18] are used for fine-tuning on a variety of CV tasks, while powerful language models like BERT [19] and GPT3 [20] pre-trained on large-scale corpus have been used for fine-tuning on a variety of downstream NLP tasks. Within the natural sciences, TL has also been explored and proven effective [21][22][23][24]. Most works, however, make use of source data from a different problem [21,22] or field [23,24], e.g. using models pre-trained on ImageNet [10] to accelerate training of material prediction tasks; this limits the efficacy of TL * cloh@mit.edu due to the dissimilarity between the source and target problems [25,26].
Extending from TL, self-supervised learning (SSL) [27,28] is a technique where the pre-training stage uses only unlabeled data. Specifically, "pretext tasks" like image rotation prediction [29] and jigsaw puzzle solving [30] are invented for the data to provide its own supervision. In particular, contrastive SSL [28] (or contrastive learning) is an increasingly popular technique where the pretext task is constructed as contrasting between two variations of a sample and other samples, where variations are derived using image transformations. The goal is for the pre-trained model to output embeddings where similar (differing) instances are closer (further) in the embedding metric space; this has proven to be highly effective for CV downstream tasks like classification, segmentation and detection. Recent contrastive learning techniques [31][32][33][34][35] have achieved unprecedented successes in CV; yet, there has been few applications to the natural sciences [36,37], partly owing to the intricacy of designing transformation strategies [38] suitable for scientific problems.
Here, we introduce Surrogate-and Invariance-boosted Contrastive Learning (SIB-CL), a deep learning framework using self-supervised and transfer learning techniques to incorporate various sources of "auxiliary information" (see Fig. 1). Unlike dominant deep learning fields like CV and NLP, problems in the natural sciences often benefit from a rich and deep array of codified and formal insight and techniques. Prominently among these are exact and approximate analytical techniques or general insights requiring minimal or no computational cost. In SIB-CL, we consider sources of auxiliary information sources that are either accessible a priori, or can be easily obtained by inexpensive means. Specifically, these are: 1) abundant unlabeled data; 2) prior knowledge in the form of invariances of the physical problem, which can be governed by geometric symmetries of the inputs or general non-symmetry related invariances of the problem; 3) a surrogate dataset on a similar problem that is cheaper to generate, e.g. by invoking simplifications or approximations to the labelling process. We show that SSL and TL techniques used in SIB-CL provide effective and broadly-applicable strategies to incorporate these auxiliary information sources, enabling effective and high-quality network training despite data scarcity. Here, its effectiveness will be demonstrated in various problems in the arXiv:2110.08406v1 [cs.LG] 15 Oct 2021 a b c d Figure 1. Overcoming data scarcity with SIB-CL. We propose to overcome data scarcity by leveraging a) an abundance of unlabeled data, b) prior knowledge of the underlying physics (e.g., symmetries and invariances of the data) and c) knowledge from a possibly-approximate surrogate data which is faster and cheaper to generate (e.g., coarse-grained computations or special-case analytical solutions). d) SIB-CL incorporates these auxiliary information into a single framework to accelerate training in data-scarce settings. Examples show unit cells of square 2D photonic crystals (see also Fig. 3).
natural sciences, in particular, on two systems in the fields of photonics and quantum calculations.

II. RESULTS
A. Surrogate-and Invariance-Boosted Contrastive Learning (SIB-CL) We seek to train a neural network to predict desired properties (or "labels") y from input x using minimal training data. More precisely, for a target problem D t = {x i , y i } Nt i=1 consisting of N t input-label pairs, we focus on problem spaces where N t is too small to successfully train the associated network. To overcome this, we introduce two auxiliary data sets: (1) a set of zero-cost unlabeled inputs D u = {x i } Nu i=1 and (2) a surrogate data set D s = {x i ,ỹ i } Ns i=1 consisting of inexpensively computed labels y i (e.g. from approximation or semi-analytical models) with associated inputx i (possibly, but not necessarily, a "simple" subset of all inputs). The quantity of each of these auxiliary data sets are assumed to far exceed the target problem, i.e. {N u , N s } N t (and, typically, N u > N s ).
On the basis of these auxiliary datasets, we introduce a new framework-Surrogate and Invariance-Boosted Constrastive Learning (SIB-CL)-that significantly reduces the data requirements on D t (Fig. 2). SIB-CL achieves this by splitting the training process into two stages: a first, interleaved twostep pre-training stage using the auxiliary data sets D u and D s (Fig. 2a,b), followed by a fine-tuning stage using the target data set D t (Fig. 2c).
In the first step of the pre-training stage (Fig. 2a), we exploit contrastive learning to learn invariances in the problem space using unlabelled inputs aggregated from the target and surrogate data sets . We complement D CL by a set of known, physics-informed invariance relations {g} (which we formally associate with elements of a group G) which map input-label pairs (x i , y i ) to (gx i , y i ), i.e., to new input with identical labels. We base this step on the SimCLR technique [32], though we also explore using the BYOL technique [33] later (see Discussion and SI Section S1). Specifically, for each input x i in D CL (sampled in batches of size B), two derived variations gx i and g x i are created by sampling 1 two concrete mappings g and g from the group of invariance relations G. The resultant 2B inputs are then fed into encoder and then projector networks, H and J respectively, producing metric embeddings z i ( ) = J(H(g ( ) x i )). A positive pair {z i , z i } is the pair of metric embeddings derived from the two variations of x i , i.e. gx i and g x i ; all other pairings in the batch are considered negative. At each training step, the weights of H and J are simultaneously updated according to a contrastive loss function defined by the normalized temperature-scaled cross entropy (NT-Xent) loss [32]: where s ii =ẑ i ·ẑ i (andẑ i = z i / z i ) denotes the cosine similarity between two normalized metric embeddingsẑ i andẑ i , [i j] uses the Iverson bracket notation, i.e. evaluating to 1 if i j and 0 otherwise, and τ is a temperature hyperparameter (fixed at 0.1 here). The total loss is taken as the sum across all positive pairs in the batch. In our batch sampling of D CL , we sample one-third of each batch from D s and two-thirds from D u . Conceptually, the NT-Xent loss acts to minimize the distance between embeddings of positive pairs (numerator of Eq. (1)) while maximizing the distances between embeddings of negative pairs in the batch (denominator of Eq. (1)) Consequently, we obtain representations H(x i ) that respect the underlying invariances of the problem. Each epoch of contrastive learning (i.e. each full sampling of D CL ) is followed by a supervised learning step-the second step of the pre-training stage (Fig. 2b)-on the surrogate dataset D s , with each input from D s subjected to a random invariance mapping. This supervised step shares the encoder network H with the contrastive step but additionally features a predictor network G, both updated via a task-dependent supervised training loss function (which will be separately detailed later). This step pre-conditions the weights of G and further tunes the weights of H to suit the target task.
The pre-training stage is performed for 100 -400 epochs and is followed by the fine-tuning stage (Fig. 2c). This final stage uses D t to fine-tune the networks H and G to the actual problem task-crucially, with significantly reduced data requirements on D t . Each input from D t is also subjected to a random invariance mapping; the associated supervised training loss function is again problem-dependent and may even differ from that used in the pre-training stage.  Figure 2. Surrogate-and Invariance-Boosted Contrastive Learning (SIB-CL) framework. Network training proceeds via a pretraining stage (a,b) followed by a fine-tuning stage (c). a,b) The pre-training stage alternates a contrastive learning step (a) using unlabeled data DCL with a supervised learning step (b) using surrogate data Ds. Contrastive learning encourages representations that respect the underlying invariances of the problem and supervised learning on the surrogate dataset attunes both representations and a predictor network to the desired prediction task. c) After 100 -400 epochs of pre-training, the encoder and predictor weights are copied and subsequently fine-tuned by supervised learning on the target dataset Dt.
In the following sections, we evaluate the effectiveness of SIB-CL on two problems: predicting the density-of-states (DOS) of two-dimensional (2D) photonic crystals (PhCs) and predicting the ground state energy of the three-dimensional (3D) non-interacting Schrödinger equation (see SI for additional experiments, including predictions of 2D PhC band structures). To evaluate the effectiveness of SIB-CL, we benchmark our results against two baselines: 1. Direct supervised learning (SL): randomly initialized networks H and G are trained using supervised learning on only the target dataset D t . This reflects the performance of conventional supervised learning, i.e. without exploiting auxiliary data sources.
2. Conventional transfer learning (TL): networks H and G are first pre-trained using supervised learning on the surrogate dataset D s and then subsequently fine-tuned on D t . This reflects the performance of conventional TL-boosted supervised learning, albeit on an unusually well-aligned transfer task.
These baselines do not exploit prior knowledge of invariances; in order to more critically evaluate SIB-CL, we also considered augmented versions of these baselines that incorporate invariance information via a simple data augmentation approach (see SI Section S4), i.e. each input is subjected to a transformation randomly sampled from {g} each time before it is fed into the encoder network H. We abbreviate them correspondingly as SL-I and TL-I.

B. 2D Photonic Crystals
Photonic crystals (PhC) are wavelength-scale periodicallystructured materials, whose dielectric profiles are engineered to create exotic optical properties not found in bulk materials, such as photonic band gaps and negative refractive indices, with wide-ranging applications in photonics [39,40]. Prominently among these applications is PhC's ability to engineer the strength of light-matter interactions [40]-or, equivalently, the density of states (DOS) of photonic modes. The DOS captures the number of modes accessible in a spectral range, i.e., the number of modes accessible to an spectrally narrow emitter, directly affecting e.g. spontaneous and stimulated emission rates. However, computing the DOS is expensive: it requires dense integration across the full Brillouin zone (BZ) of the PhC and summation over bands. Below, we demonstrate that SIB-CL enables effective training of a neural network for prediction of the DOS in 2D PhCs, using only hundreds to thousands of target samples, dramatically reducing DOS-computation costs. Such neural networks could help to accelerate the design of PhC features, either directly via backpropagation [41] or by offering a cheap evaluation for multiple invocations of the model, replacing conventional design techniques like topology optimization [42] and inverse design [43].
Dataset generation. PhCs are characterised by a periodically varying permitivitty, ε(r), whose tiling makes up the PhC's structure. For simplicity, we consider 2D square lattices of periodicity a with a "two-tone" permitivtty profile, i.e. ε ∈ {ε 1 , ε 2 }, with ε i ∈ [1, 20]. We assume lossless isotropic materials so that ε(r) and the resultant eigenfrequencies are real. We use a level-set of a Fourier sum function (see Methods for details) to parameterize ε(r), discretized to result in a 32 × 32 pixel image, which form the input to the neural network. Special care was taken in the sampling algorithm to create diverse unit cells with features of varying sizes, with examples depicted in Fig. 3a.
We define the DOS of 2D PhCs by [44] with ω denoting the considered frequency, ω nk the PhC band structure, n the band index, k the momentum in the BZ, and A = a 2 the unit cell area. In practice, we evaluate Eq.
(2) using the generalized Gilat-Raubenheimer (GGR) method [45]which incorporates the band group velocity extrapolatively to accelerate convergence-in an implementation adapted from Ref. 46. The band structure and associated group velocities are evaluated using the MIT Photonic Bands (MPB) software [47] for the transverse magnetic (TM) polarized bands (Fig. 3b, also see Methods). We define labels for our network using the computed DOS values (Fig. 3c) subjected to three simple post-processing steps (see Methods): (i) spectral smoothing using a narrow Gaussian kernel S ∆ , (ii) shifting by the DOS of the "empty-lattice" (i.e., uniform lattice of index n avg ), DOS EL (ω) = ωa 2 n 2 avg /2πc 2 , and (iii) rescaling both DOS-and the frequency-values by the natural frequency ω 0 = 2πc/an avg . More explicitly, we define the network-provided DOS labels as and sample over the normalized spectral range 0 ≤ ω/ω 0 ≤ 0.96.
Step (i) accounts for the finite spectral width of physical measurements and regularizes logarithmic singularities associated with van Hove points; step (ii) counteracts the linear increase in average DOS that otherwise leads to a bias at higher frequencies, emphasizing instead the local spectral features of the DOS; and step (iii) ensures comparable input- The DOS spectrums were then smoothed, standardized, and the "empty-lattice" DOS were subtracted from them to derive the labels of the dataset. and output-ranges across distinct unit cells, regardless of the cell's average index.
In our experiments, we use 20 000 unlabelled unit cells for contrastive learning, select a target dataset of varying sizes N t ∈ [50, 3000] for fine-tuning, and evaluate the prediction accuracy on a fixed test set containing 2000 samples.
For the surrogate dataset of inexpensive data, D s , we created a simple dataset of 10 000 PhCs with centered circular inclusions of varying radii r ∈ (0, a/2] and inclusion and cladding permittivities sampled uniformly in ε i ∈ [1, 20]. This simple class of 2D PhCs is amenable to semi-analytical treatments, e.g. Korringa-Kohn-Rostoker or multiple scattering approaches [48][49][50][51], that enable evaluation of the DOS at full precision with minimal computational cost. Motivated by this, we populate the surrogate dataset D s with such circular inclusions and their associated (exact) DOS-labels. 2 Invariances of PhC DOS. The considered PhCs possess no spatial symmetries beyond periodicity. Despite this, as an intrinsic, global quantity (or, equivalently, a k-integrated quantity) the DOS is setting-independent and invariant under all size-preserving transformations, that is, under all elements of the Euclidean group E(2). For simplicity's sake, we restrict our focus to the elements of E(2) that are compatible with a pixelized unit cell (i.e., that map pixel coordinates to pixel coordinates). This subset is the direct product of the 4mm (C 4v ) point group G 0 of the point lattice spanned by ax and aŷ and the group G t of pixel-discrete translations. In more detail: 1. Point group symmetry (G 0 ): 4mm includes the identity operation (1), 2-and 4-fold rotations (C 2 and C ± 4 ), and horizontal, vertical, and diagonal mirrors (σ h , σ v , and Formally, this is the PhCs' holosymmetric point group.

Translation symmetry (G t ):
While the DOS is invariant under all continuous translations t, the pixelized unit cells are compatible only with pixel-discrete translations; i.e., we consider the (factor) group G t = {iN −1 ax + jN −1 aŷ} N−1 i, j=0 with N = 32. Additionally, the structure of the Maxwell equations endows the DOS with two non-Euclidean "scaling" invariances [39]: 1. Refractive scaling (G s ): The set of (positive) amplitude-scaling transformations of the refractive index g(s)n(r) = sn(r) define a group G s = {g(s) | s ∈ R + } and map the PhC eigenspectrum from ω nk to s −1 ω nk . Equivalently, g(s) maps DOS(ω) to sDOS(sω) and thus leaves the y-labels of Eq.
Of G s and G s , only the amplitude-scaling G s is pixelcompatible (G s can be implemented as a tiling-operation in the unit cell, which, however requires down-sampling). Accordingly, we restrict our focus to the pixel-compatible invariances of the product group G = G 0 × G t × G s and sampled its elements randomly. In practice, the sampling-frequency of each element in G is a hyperparameter of the pre-training stage (see Methods and SI Section S5).
PhC DOS prediction. To assess the trained network's performance in an easily interpretable setting, we define the evaluation error metric, following Ref. 46: where DOS ∆ = S ∆ * DOS = ω −1 0 y + DOS EL and DOS pred ∆ = ω −1 0 y pred + DOS EL are the true and predicted S ∆ -smoothened DOS, respectively, and the sums are over the spectral range 0.24 ≤ ω/ω 0 ≤ 0.96 (we omit the spectral region 0 ≤ ω/ω 0 < 0.24 during evaluation to get a more critical metric, since the DOS has no significant features there). The network architecture and training details (loss functions, hyperparameters, layers etc.) are discussed in the Methods section.
The performance of SIB-CL under this error measure is evaluated in Fig. 4 and contrasted with the performance of the SL and TL baselines. In practice, to minimize the fluctuations due to sample selection, we show the mean of L eval for three separate fine-tuning stages on distinct randomly-selected datasets of size N t , evaluated on a fixed fixed test set.
A significant reduction of prediction error is observed for SIB-CL over the baselines, especially for few fine-tuning samples: e.g., at N t = 100, SIB-CL has 4.6% error while SL (TL) has 7.6% (6.9%) error. More notably, we see a large reduction in the number of fine-tuning samples N t needed to achieve the same level of prediction error, which directly illustrates the data savings in the data-scarce problem. We obtain up to 13× (7×) savings in N t when compared to SL (TL) at a target prediction error of ∼ 5.3%. The predicted and true DOS are compared as functions of frequency in Fig. 4b across a range of error levels as realized for different unit cell input.
SIB-CL was also observed to outperform the invarianceaugmented baselines SL-I and TL-I (see SI Section S4), albeit by a smaller margin than the unaugmented baselines, consistent with expectations. Nevertheless, we observe a clear performance advantage of SIB-CL over TL-I, demonstrating the effectiveness and utility of the constrastive learning step in SIB-CL compared to naive augmented TL-approaches.
To demonstrate that the effectiveness of SIB-CL extends beyond the DOS prediction problem, we also trained a network using SIB-CL and all baselines to predict the PhC band structure (see SI Section S2). For this task, the network labels y are ω nk /ω 0 , sampled over a 25 × 25 k-grid and over the first 6 bands, i.e., y ∈ R 6×25×25 ≥0 , while the input labels x remain unchanged. Unlike the DOS, the band structure is not invariant under the elements of G 0 , but remains invariant under translations (G t ) and refractive amplitude scaling (G s ), i.e. G = G t × G s . Also for this task, we found SIB-CL to enable significant data savings, ranging up to 60× relative to the SL baseline.

C. 3D Schrödinger Equation
As a test of the applicability of SIB-CL to higherdimensional problems, we consider predicting the ground state energies of the single-particle, time-independent Schrödinger Equation (TISE) for random three-dimensional (3D) potentials in box. This problem demonstrates a proof-of-principle application to the domain of electronic structure calculations, which is of fundamental importance to the understanding of molecules and materials across physics, chemistry, and material science.
Dataset generation. The eigenstates ψ n and eigenenergies E n of a (non-relativistic) single-particle electron in a potential U(r) are the eigensolutions of the TISE: whereĤ =T +Û is the Hamiltonian consisting of kinetiĉ T = − 1 2 ∇ 2 and potential energyÛ = U(r) contributions. Here, and going forward, we work in Hartree atomic units (h.a.u.). For simplicity, we consider random potentials U(r) confined to a cubic box of side length 10 Bohr radii (a 0 ), with values in the range [0, 1] Hartree (see Methods for details). Examples of the generated potentials are shown in Fig. 5a (left). We associate the network input-label pairs (x, y) with the potentials U(r) (sampled over a 32 × 32 × 32 equidistant grid) and ground state energies E 0 , respectively. We evaluate E 0 by using (central) finite differences with implicit Dirichlet boundary conditions to discretize Eq. (5), which is subsequently solved using an iterative sparse solver [52]. The target dataset D t is computed using a 32 × 32 × 32 finite-differences discretization, with an estimated mean numerical error ≈ 0.1% (Fig. 5b, left).
In the previously considered PhC DOS problem, the surrogate dataset D s was built from a particularly simple input class with exact and inexpensive labels. Here, instead, we assemble D s by including the original range of inputs x but using approx-imate labelsỹ. In particular, we populate the surrogate dataset with input-label pairs (x,ỹ), withỹ =Ẽ 0 computed from a low-resolution finite-difference 5 × 5 × 5 discretization of U(r) (Fig. 5b).Ẽ 0 has a relatively high error of ∼ 10% (Fig. 5b, right) but is orders of magnitude faster to compute: e.g., a naive power iteration eigensolver requires O(n 2 ) operations per iteration (with n = N 3 denoting the number of grid-voxels and N the grid-points per dimension), such that iterations at N = 5 require ∼ 10 5 -fold less work than at N = 32.
To assess the impact of the choice of surrogate data, we also examine an alternative surrogate dataset, with input-label pairs (x,ỹ), derived from quantum harmonic oscillator (QHO) potentials:x where (A •n ) i = A n i is the Hadamard (element-wise) power operator. We define the associated surrogate labels by the open-boundary QHO energies, i.e. byỹ =Ẽ 0 = 1 2 i ω i , and assign the inputx by the in-box grid discretization ofŨ(r). Theỹ labels consequently reflect an example of analytically approximated labels (here, with approximation-error due to the neglect of the Dirichlet boundary conditions); see SI Section S3. For quicker training of the network, we use the 2D version of the TISE with this surrogate dataset (i.e. D s and D t consist of 2D QHO potentials and 2D random potentials respectively).
Ground-state energy prediction. The ground-state energy is invariant under elements of the symmetry point group, i.e. G = G 0 in 2D. In 3D, we instead have the m3m point group, which notably has 48 elements (instead of just 8 in G 0 ). Figure 5c shows the results using the surrogate dataset of reduced resolution data, compared against the baselines. We observe up to 40× data savings for SIB-CL when compared to SL; SIB-CL was also observed to outperform the augmented baselines, SL-I and TL-I (see SI Section S4). As a validation step, the prediction accuracies are noted to be in the orders of ≈ 1%, making the surrogate (target) dataset with ≈ 10% (≈ 0.1%) numerical error an appropriate design choice as approximate (target) data. For the experiments using the QHO surrogate dataset, we obtain up to 4× savings when using SIB-CL compared to SL (see SI Section S3); the data savings are diminished, within expectations, since the QHO dataset is way simpler and contains less information to "transfer".

III. DISCUSSION
The widespread adoption and exploitation of data-driven techniques, most prominently deep learning via neural networks, to scientific problems has been fundamentally limited by a relative data scarcity. That is, data is only rarely available in the quantities required to train a network to faithfully reproduce the salient features of nontrivial scientific problems; moreover, even if such data can be generated, it typically requires excessive computational effort. Here, we have introduced SIB-CL, a novel framework that overcomes these fundamental challenges by incorporating prior knowledge and auxiliary information, including problem invariances, "cheap"  We use a finite-difference numerical method with Dirichlet boundary conditions to solve for the ground-state energies, E0. b) Left: Generation of the reduced resolution surrogate dataset, where the input unit cell resolution was reduced from 32 × 32 × 32 to 5 × 5 × 5 and the same numerical solver was used to obtain the approximated ground-state energy. Right: The relative numerical error of E0 obtained from the numerical solver when compared to N = 128 (which is taken to be the 'theoretical' value) shows that N = 5 (N = 32) is a good choice for the reduced (original) resolution since it gives a relatively high (low) error of 10% (0.1%). c) Network prediction error against fine-tuning dataset sizes Nt for SIB-CL and the baselines on the TISE problem. For SIB-CL, we show the best results among the two contrastive learning techniques, SimCLR [32] and BYOL [33]. SIB-CL was seen to give up to 40×(6×) data savings when compared to SL (TL).
problem classes, and approximate labels. With SIB-CL, the required quantities of costly, high-quality training data is substantially reduced, opening the application-space of data-driven techniques to a broader class of scientific problems.
We demonstrated the versatility and generality of SIB-CL by applying it to problems in photonics and electronic structure, namely to the prediction of the DOS and band structures of 2D PhCs and the ground state energies of the TISE. Through our experiments, we demonstrated that even very simple sources of auxiliary information can yield significant data savings. For instance, the group of invariances G can be just a set of simple rotations and mirrors as in the TISE problem. Similarly, there are diverse options for constructing the surrogate dataset: here, we explored the use of simplified structures where (semi-) analytical solutions exist (e.g., circular structures of PhCs), approximate calculations of the target problem (e.g., reduced resolution computations of TISE), and even a combination of the two (e.g., approximated energies of QHO potentials in the TISE problem). Most natural science disciplines, especially physics, have deep and versatile caches of such approximate and analytical approaches which can be drawn from in the creation of suitable surrogate datasets.
In the problems studied here, SIB-CL outperformed all baselines (including invariance-augmented baselines, see SI Section S4). We conjecture that SIB-CL's performance advantage stems predominantly from enforcing problem invariances via a contrastive loss, which is more effective than naive data augmentation (cf. the performance edge of SIB-CL over TL-I).
To substantiate this hypothesis, we performed several ablation experiments (see SI Section S5). Firstly, when all invariance information are disregarded in SIB-CL (i.e., if the group of invariances G is reduced to the trivial identity group), we observe very similar performance to TL. This demonstrates that the constrastive stage is only effective in combination with invariance information, or, equivalently, that the utility of the contrastive stage hinges strongly on attracting nontrivial positive pairs rather than merely repelling negative pairs.
Next, we explored the consequences of selectively and increasingly removing invariances from G. We found that including more invariances strictly improves SIB-CL's performance, consistent with expectations since the elements of G are true invariances of the downstream task. This is contrary to the standard self-supervised learning (SSL) paradigm which is task-agnostic, i.e., the downstream task is not known during the contrastive learning stage, and transformations may not be true invariances of the downstream problem so including more transformations can sometimes be harmful [32,38,53].
While contrastive learning has gained enormous popularity in recent years, its techniques has mainly found applications in computer vision tasks (e.g., image classification on ImageNet [10]) while its utility to regression problems has remained largely unexplored. Techniques like SimCLR are based on instance discrimination, i.e., the network is trained to discriminate between "negative" pairs in the batch. Intuitively, such techniques may seem less well-suited to regression problems where the latent space is often continuous rather than discrete or clustered as in classification problems. Indeed, we made several empirical observations that disagree with the findings of standard contrastive learning applications on classification problems. Notably, it is widely corroborated [32,54,55] that using a larger batch size is always more beneficial, which can be interpreted as the consequence of having more negative pairs for instance discrimination. This empirical finding was not echoed in our experiments, thus suggesting that instance discrimination may not be highly appropriate in regression problems. Motivated by this, we also explored the BYOL technique [33] which is not based on instance discrimination and does not use explicit negative pairs in its contrastive loss function (see SI Section S1), but found no performance advantage. Despite many empirical successes, contrastive learning remains poorly understood and lacks a solid theoretical explanation [38,[56][57][58] for why and when these algorithms work well. Our work further underscores and motivates the need to develop such an improved foundation, not only to address the noted deviations from expectations but also to guide the emerging application of contrastive learning techniques to regressions tasks.
Our work is complementary to the growing body of work on equivariant networks for various symmetry groups [59][60][61][62], particularly for applications in the natural sciences [63,64]. These works are mainly motivated by the fact that the exploitation of symmetry knowledge provides a strong inductive bias, or constraint on the space of possible models, allowing the network to achieve better predictive accuracy (equivalently, higher data efficiency at the same level of accuracy). A wellknown illustration of this idea is the superiority of convolutional neural networks (CNNs) over simple fully connected (FC) networks for image tasks, arising from the translation equivariance of CNNs. Like equivariant networks, SIB-CL also aims to create a network that respects the underlying symmetries of the problem. However rather than "hard-coding" invariance information into the model architecture, the process is implemented/achieved organically via contrastive learning. The price paid for this more generic approach, is that feature invariance to the symmetry group G is only approximately achieved-to a degree expressed indirectly by the NT-Xent loss (Eq. (1))-rather than exactly as is the case for "hardcoded" problem-specific equivariant architectures. Conversely, SIB-CL has the advantage of being simple and readily generalizable to any known invariance, i.e., requires no specialized kernels, and can readily incorporate additional invariances without changes to the underlying architecture. Relatedly, SIB-CL's superior performance over TL-I similarly suggests that using contrastive learning to enforce invariances is likely to be more effective than naive data augmentation.
Our work provides new insights on how issues of data scarcity can be overcome by leveraging sources of auxiliary information in natural science problems. The SIB-CL framework presented in this work demonstrates how such auxiliary information can be readily and generically incorporated in the network training process. Our work also provides new insights on the thus-far less-explored application of contrastive learning for regression tasks, opening up opportunities for applications in several domains dominated by regression problems, in particular, the natural sciences.

A. Dataset generation
PhC unit cells. We parameterize ε(r) by choosing a level set of a Fourier sum function φ, defined as a linear sum of plane waves with frequencies evenly spaced in the reciprocal space (up to some cut-off). i.e.
where each n k is a 2D vector (n x , n y ) and we used 3 Fourier components per dimension, i.e. n x , n y ∈ [−1, 0, 1] (and thus the summation index k runs over 9 terms). c k is a complex coefficient, c k = re iθ with r, θ separately sampled uniformly in [0, 1). Finally, we uniformly sample a filling fraction, defined as the fraction of area in the unit cell occupied by ε 1 , in [0, 1) to determine the level set ∆ so as to obtain the permitivitty profile: This procedure produces periodic unit cells with features of uniformly varying sizes due to the uniform sampling of the filling ratio and without strongly divergent feature scales thus corresponding to fabricable designs.
PhC DOS processing. With the MIT Photonic Bands (MPB) software [47], we use 25 × 25 plane waves (and also a 25 × 25 k-point resolution) over the Brillouin zone −π/a < k x,y ≤ π/a to compute the band structure of each unit cell up to the first 10 bands and also extract the group velocities at each k-point. We then computed the DOS for ω/ω 0 ∈ [0, 0.96] over 16 000 equidistantly-spaced frequency samples using the generalized Gilat-Raubenheimer (GGR) method [45,46]. Next, we computed the S ∆ -smoothened DOS, i.e., DOS ∆ = S ∆ * DOS, using a Gaussian filter S ∆ (ω) = e −ω 2 /2∆ 2 / √ 2π∆ of spectral width ∆ = 0.006ω 0 . Before defining the associated network labels y, we downsampled DOS ∆ to 400 frequency points. Finally, the network y-labels are constructed according to Eq. (3), i.e., by subtracting the "background" empty-lattice DOS-i.e., DOS EL (ω) = a 2 n 2 avg ω/2πc 2 , the DOS of a uniform unit cell Ω of index n avg = 1 |Ω| Ω n(r) d 2 r-and rescaling by ω 0 . Subtracting DOS EL removes a high-frequency bias during training and was found to improve overall network accuracy.
3D TISE unit cells. To generate samples of U(r), we follow the same procedure in Eqs. (7) and (8) to first create two-tone potential profiles in 3D, i.e. r = (x, y, z) and n k = (n x , n y , n z ) are now 3D vectors. We create finer features by increasing the number of Fourier components to n x , n y , n z ∈ [−2, −1, 0, 1, 2] (and hence the summation in Eq. (7) now runs over 125 terms). We also modify the range of potential, i.e. ε 1 in Eq. (8) is set to 0, while ε 2 is uniformly sampled in [0, 1]. The periodicity is removed by truncating 20% of the unit cell from each edge. A Gaussian filter with a kernel size 8% of the (new) unit cell is then applied to smooth the potential profile and, finally, the unit cells are discretized to a resolution of 32 × 32 × 32. This procedure is illustrated in SI Section S3 and is similarly used to produce the 2D unit cells, discretized to 32 × 32, when using the QHO surrogate dataset. The ratio between the length scale and potentials' dynamic range was also carefully selected to produce non-trivial wavefunctions, so as to create a meaningful deep learning problem (see SI Section S3 for further discussion).

B. Model Architecture
Our encoder network, H consists firstly of 3 to 4 convolutional neural network (CNN) layers followed by 2 fully connected (FC) layers, where the input after the CNNs was flattened before being fed into the FCs layers. The channel dimensions in the CNN layers and number of nodes in the FC layers vary for the different problems, and are listed in Table I. For TISE, the CNN layers have 3D kernels to cater for the 3D inputs, while the CNNs for the remaining problems uses regular 2D kernels used in standard image tasks. For the predictor network, G, we used 4 FC layers for all the problems, with number of nodes listed in Table II. The predictor network for the band structure problem consists of 6 blocks of the same layer architecture, each block leading to each of the 6 bands and separately updated using the loss from each band during training. A similar architecture was used in previous work [5]. We included BatchNorm [65], ReLU [66] activations and MaxPooling between the CNN layers, and ReLU activations between all the FC layers in H and G. For the projector network J, we used 2 FC layers with hidden dimension 1024 and ReLU activation between them; the final metric embeddings have dimension 256. J is fixed across all problems. Using the DOS prediction problem, we also experimented with deeper projector networks (i.e. increasing to 4 FC layers with the same hidden dimensions), as well as including BatchNorm between the layers, and found small improvements. Invariance sampling during contrastive learning. In conventional contrastive learning applications in computer vision (CV), different instances of the input are often created via a pre-optimized, sequential application of various data augmentation strategies such as random cropping, color distortion and Gaussian blur [32,33]. Adopting this technique, we also apply transformations from each sub-group of G in the randomly determined order [G t , G 0 , G s ] and, additionally, experimented various algorithms for performing contrastive learning; see SI Section S5. We find that introducing stochasticity in transformation application is an effective strategy and thus use it in SIB-CL. More specifically, for each sub-group G α , with α ∈ {0, t, s}, we set a probability p α to which any non-identity transformation is applied. (Equivalently, inputs are not transformed with probability (1 − p α ).) {p α } is a set of hyperparameters that are often intricately optimized for in standard CV applications (among other hyperparameters such as the order and strength of augmentations); here, for simplicity, we omitted this optimization step. We set p α = 0.5 for all α's, and sampled the elements uniformly, i.e. each transformation in G α is applied with probability 0.5/m α with m α being the total number of non-identity elements in G α .
PhC DOS prediction loss functions. In step b of the pretraining stage where we trained using supervised learning loss on D s (Fig. 2b), we used the pre-training loss function for each sample in the batch, where y pred and y are the network prediction and the true label of that sample respectively and | · | gives the element-wise absolute value. We take the mean over the (normalized) frequency axis (ω/ω 0 ) to get a scalar for L PT . This loss function was used during pre-training (for SIB-CL and the TL baselines); its purpose is to encourage the network to learn from the surrogate dataset the general features in the DOS spectrum and underemphasize the loss at places where the DOS diverges, i.e. at the Van Hove singularities. In our experiments, we found that L PT indeed gave better prediction accuracies than the standard L1 or mean squared error (MSE) loss functions. After the pre-training step, the standard L1 loss function was used during fine-tuning on D t (Fig. 2c) for SIB-CL and all the baselines.
PhC band structure prediction loss functions. During supervised training (for both pre-training and fine-tuning), we use the MSE loss function; for evaluation, we use a relative error measure (for easier interpretation) given by, where ω n (k) are the eigen frequencies indexed over band numbers n = 1, 2, ..., 6 and k are the wave vectors restricted to the Brillouin zone, i.e. −π/a < k x,y ≤ π/a. The evaluation loss is taken as the mean over all 6 bands and over all (25 × 25) k-points.
Ground-state energy prediction loss functions. The MSE loss function is similarly used during both the pre-training and fine-tuining stages of supervised training of the ground-state energy prediction problem. During evaluation, we use a simple relative error measure, where y pred is the network prediction and y = E 0 is the recorded ground-state energy, for each sample in the test set.
Training hyperparameters. For training the networks in all problems, we used Adam optimizers [67], with learning rates for the different steps specified in Table III. We also use an adaptive learning rate scheduler for the fine-tuning stage. Even though standard contrastive learning methods implement a cosine annealing scheduler [68], we found that this was not beneficial for SIB-CL on our problems and thus omitted it. Additionally, in order to prevent networks H and G from overfitting to the surrogate dataset, we explored various conventional regularization techniques during the pre-training stage, such as weight decay and dropout. We found that these were not beneficial; instead, we used early stopping where we saved the pre-trained model at various epochs and performed the fine-tuning stage on all of them, picking only the best results to use as the final performance. For SIB-CL, the pre-trained model was saved at {100, 200, 400} epochs, and for TL (both with and without invariances), the pre-trained model was saved at {40, 100, 200} epochs. Finally, another important hyperparameter in our experiments is the kernel size (n k ) of the CNN layers; apart from optimizing the learning process, this hyperparameter can be used to adjust the network size. This is important in our experiments since we are training/fine-tuning on varying sizes N t of the target dataset; a smaller (bigger) dataset is likely to need a smaller (bigger) network for optimal results. For the DOS prediction, we varied n k ∈ {5, 7}; for band structures, n k ∈ {7, 9, 11} and for TISE, n k ∈ {5, 7}. The same set of n k was applied for both SIB-CL and all baselines in every problem. Apart from those mentioned here, SIB-CL involves many other hyperparameters not explored here; see additional comments in SI Section S6.

DATA & CODE AVAILABILITY
The neural networks were implemented and trained using the PyTorch framework [69]. PhC band structures were computed using MPB [47]. Numerical solution of the TISE ground-state energies was implemented in Python using SciPy [70]. DOS calculations were carried out using the GGR method, adapted from the MATLAB implementation in Ref. 46. All source codes used for dataset generation and network training are publicly available at https://github.com/clott3/SIB-CL.

ACKNOWLEDGEMENTS
We thank Peter Lu, Andrew Ma, Ileana Rugina, Hugo Larochelle, and Li Jing for fruitful discussions. We acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported here. This work was sponsored in part by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was also sponsored in part by the the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI  Institute for Artificial Intelligence and Fundamental Interac

S1. BYOL ILLUSTRATION & RESULTS
In BYOL [S2], instead of a single network like in SimCLR [S1], there are two networks of the same architecture (Supplementary Figure 1a): a trainable online network (with weights parameterized by θ) and a fixed target network (with weights parameterized by ξ). The two variations of a unit cell (obtained via sampling from G) are separately fed through the online and target network, where the online network features an additional "BYOL predictor" sub-network which is not present in SimCLR. The weights of the online network θ are then updated by minimizing the mean squared error (MSE) between the (normalized) online and target embeddings, as given by, where · indicates the L2 norm, K is the "BYOL predictor" and z represent the respective embeddings illustrated in Supplementary Figure 1a. As for the target network, at each training step, it is updated using an exponential moving average of the online network weights, where τ ∈ [0, 1] is a target decay rate hyperparameter. Unlike SimCLR's loss function (Eq. (1) in the main text), where negative pairs are explicitly being "pushed apart" due to the denominator of the cross entropy loss function, the BYOL loss (Eq. (S1)) does not explicitly involve negative pairs.
In Supplementary Figure 1b, we compare the results when replacing the SimCLR technique in SIB-CL with the BYOL technique, evaluated on the DOS prediction problem for Photonic Crystals (PhC); we also depict the baselines' performances for comparison. When including the full suite of invariance information (i.e. translations (t), rotations (C), mirrors (σ) and scaling (s)), BYOL was observed to give slightly poorer performance compared to SimCLR. Here, for convenience, rotation and mirror operations are collectively defined as C = {C 2 , C ± 4 } and σ = {σ h , σ v , σ ( ) d } respectively. When removing the scaling transformation, BYOL was seen to give similar results to SimCLR (as well as similar to BYOL including the scaling transformation). This suggests that the use of negative pairs during contrastive learning has no significant implications, despite initial intuition that such a concept may seem inappropriate for regression problems. Additionally, our results also suggest that BYOL was observably not effective at handling the scaling transformation. This observation was further echoed in the TISE ground-state energy prediction  Fig. 2 in the main text, except here we include explicit reference to the network weights using θ and ξ. b) Results for using BYOL in SIB-CL compared against when using SimCLR on the PhC DOS prediction problem, when using various combinations of invariances: translations (t), rotations (C) and mirrors (σ), and scaling (s). c) Results for using BYOL in SIB-CL compared against when using SimCLR on the ground-state energy prediction problem (in 3D, and using low resolution data as the surrogate dataset). Figure 1c); here, the performance differences between the two techniques were observed to mostly occur within the 1σ uncertainty range. In other words, this result implies that neither technique outperforms the other with statistical significance.

S2. PHOTONIC CRYSTALS BAND STRUCTURE PREDICTION
In this section, we describe additional experiments of performing PhC band structure regression, where we trained a network to predict the transverse-magnetic (TM) band structures of the PhCs, up to the first 6 bands. Unlike the DOS, we do not integrate over the Brillouin zone and thus the set of invariant transformations is reduced to just the choice of the unit cell origin (i.e., to translation invariance, G = G t ). During supervised training (for both pre-training and fine-tuning), we use the mean squared error (MSE) loss function; for evaluation, we use a relative error measure (for easier interpretation) given by, where ω n (k) are the eigen frequencies indexed over band numbers n = 1, 2, ..., 6 and k are the wave vectors restricted to the Brillouin zone, i.e. −π/a < k x,y ≤ π/a. The evaluation loss is taken as the mean over all 6 bands and over all k-points. Details of the network architecture and training hyperparameters used in these experiments are described in the Methods section of the main text.
The results are shown in Supplementary Figure 2a, where SIB-CL was observed to give more than 60× data savings compared to any of the baselines. Examples of the band structure predicted by the trained  for the contrastive learning step), compared against the baselines introduced in the main text, for band structure prediction. Error bars show the 1σ uncertainty level estimated from varying the data selection of the fine-tuning dataset. b) Examples of the band structure predicted by the SIB-CL trained network (indicated by markers) compared against the actual band structure (indicated by surface plots) at 1.5% and 5% error levels; insets depict the associated unit cells.
network at various error levels are also depicted in Supplementary Figure 2b, where the markers indicate the network prediction and the surface plots depict the true band structure. While the data savings is highly impressive, we found the choice of the evaluation metric to be more subtle (than the DOS prediction task). This is because most of the unique features in the band structure of each unit cell are contained only at some k-points, and the band structure at the remaining k-points can be very well approximated using the "empty-lattice" (a featureless unit cell filled entirely by the average permitivitty); thus we undesirably under-emphasize the prediction error when we take the mean over k. This is further illustrated in Supplementary Table 1, where we show the relative numerical error (according to metric Eq. (S3)) of the band structures when computed using a reduced resolution (i.e. approaching the "empty-lattice") of the unit cell. At a resolution of 4 × 4 we have an error amounting to only 3.5%, suggesting that the "empty-lattice" band structure is a rather good approximation. In other words, in order to critically evaluate the band structure prediction problem, an asymmetric loss function (that emphasizes the differences at the "special" k-points) should be used. Nevertheless, Supplementary Figure 2a serves as clear evidence that SIB-CL outperforms the baselines (since we use the same evaluation metric in SIB-CL and all baselines). In contrast, the DOS does not have this issue (Supplementary Table 1) and thus it is acceptable to use a straightforward evaluation metric, as presented in the main text.

S3. TISE: ADDITIONAL DETAILS AND RESULTS
The process of generating the random potentials U(r) for the time-independent Schrödinger equation (TISE) problem, as explained in the Methods section of the main text, is illustrated in Supplementary  Figure 3a (here in 2D for illustration purposes). When generating the dataset, special care was taken in choosing the dynamic range of the potential and the physical size of the infinitely-bounded box to ensure that we produce non-trivial solutions. A dataset with trivial solutions would be one where the samples have their wavefunctions either 1) resemble the ground-state solution of a "particle in a box", i.e. the variations in the potential are too weak relative to the confinement energy of the box, or 2) highly localized in local minimas of the potential, i.e. the variations in the potentials are too strong relative to the confinement energy. Both extremes would not create a meaningful deep learning problem and thus we avoid them by carefully controlling the ratio between the potential range and length scale. In Supplementary Figure 3b, we display some examples of potentials in our dataset and their corresponding ground-state solution, and verify that they indeed avoid the two aforementioned extremes. We show them in 2D here for illustration purposes, whereas in the main text the TISE problem is in 3D.
Apart from using reduced resolution data as the surrogate dataset for the TISE problem as presented in the main text, we also experimented with using a simple, analytically-defined dataset like a dataset of quantum harmonic oscillators (QHO  using different values of ω = (ω x , ω y ) and c = (c x , c y ) iñ and the ground-state energy can be analytically defined asẼ 0 = 1 2 (ω x + ω y ) (see Supplementary Figure 3c). We experimented with the dynamic ranges of ω and c to ensure that the potential of the unit cells are sufficiently large near the infinite boundaries, since our prediction task assumes Dirchlet boundary conditions. To do so, we solved for the ground-state energies of a small sample set of QHO potentials using our eigensolver (with Dirichlet boundary conditions) and chose dynamic ranges of ω and c such that the error of the analytical solution from the numerical solution is small. The final values we chose are ω x , ω y ∈ [0.3, 3.2] and c x , c y ∈ [0, 4.5], which gave an acceptable error of around 2.6% (defined according to the same loss metric used for network evaluation).
Results for using QHO potentials with analytically-defined solutions as the surrogate dataset is shown in Supplementary Figure 3d, where SIB-CL was seen to also outperform the baselines. However, we observe a smaller performance margin from the baselines, as well as between the two baselines, as compared to the other problems; this is likely due to the simplicity of the QHO surrogate dataset.

S4. ADDITIONAL BASELINES: INCLUDING INVARIANCE INFORMATION USING DATA AUGMENTATION
The baselines introduced in the main text, SL and TL, do not invoke any invariance information during network training. In Supplementary Figure 4, we explore adding invariance information to these baselines via a simple data augmentation approach; we refer to them as SL-I and TL-I for invariance-augmented SL and TL respectively. Specifically, the training procedure of SL-I corresponds exactly to Fig. 2c in the main text, i.e. each sample in the fine-tuning dataset undergoes a transformation randomly sampled from G before entering the encoder network H. Similarly, TL-I corresponds exactly to Fig. 2bc, where samples from the surrogate and target dataset are each transformed with a random element from G before entering the encoder H during both the pre-training and fine-tuning stages. In Supplementary Figure 4 In all four problems, we found that SIB-CL outperforms all the baselines, some with significant margins.

S5. "BAKING IN" INVARIANCES USING CONTRASTIVE LEARNING: ABLATION STUDIES AND FURTHER EXPERIMENTS
Using the notation in the main text, the group of invariant transformations used for the PhC DOS prediction problem can be expressed as G = {1 , t , C , σ , s}, where we use C = {C 2 , C ± 4 } and σ = {σ h , σ v , σ ( ) d } to indicate the set of all rotation and flip operations respectively. In Supplementary Figure 5a, we present results from an ablation experiment on SIB-CL, where we selectively, and increasingly, remove transformations from G. We observe SIB-CL's prediction accuracy to monotonically worsen (improve) as more transformations are removed (added) from G. This is expected, since all of these transformations are true invariances of the (DOS prediction) problem, and thus including more types of transformation is equivalent to incorporating more prior knowledge of the problem which should improve the model accuracy. This observation is typically not echoed in standard contrastive learning applications in computer vision; a possible reason is that the data augmentation strategies used to generate the transformations there are not necessarily true invariances since the downstream task is not known during contrastive learning and thus some transformations may prove ineffective [S5]. This observation also validates the effectiveness of using contrastive learning to invoke prior knowledge of physical invariances.
Next, we also performed an ablation experiment where we discarded all invariance information from the contrastive learning step, for both SimCLR and BYOL. To do so, we reduce G to the trivial identity group,   Supplementary Figure 5b, where we observe that both SimCLR and BYOL gave prediction accuracies that coincided very well to the TL baseline (which includes only information from the surrogate dataset and not invariances). This further validates SIB-CL's approach of using contrastive learning to invoke knowledge of invariances-once this knowledge is removed, doing contrastive learning as a pre-training step provides no benefits. Equivalently, this suggests that the utility of SIB-CL is derived from attracting non-trivial positive pairs (i.e. learning invariances) rather than repelling negative pairs.
Another insightful ablation experiment is to retain all elements of G during pre-training, but reduce G to the trivial identity group only during fine-tuning. This experiment can be explored for SIB-CL and TL-I (but not SL-I since doing so would give exactly SL) and will tell us how effectively each method had "baked in" invariance information into the representions during the pre-training stage. We refer to these experiments as SIB-CL-rt and TL-I-rt correspondingly. From Supplementary Figure 5c, the high correspondence observed between SIB-CL and SIB-CL-rt suggests that SIB-CL learns, and retains, most of the invariance information from the pre-training stage. On the contrary, the gap between TL-I and TL-I-rt suggests that while some invariance information is learnt during pre-training (indicated by the gap between TL and TL-I-rt), much of its "invariance learning" also occurs during the fine-tuning stage. This suggests that, compared to TL-I, representations learnt from SIB-CL have higher semantic quality; this is highly desirable, for example, in domains where there exist a series of different tasks (that obey the same invariances) since these representations can be simply "reused" without needing to re-train the network.
Finally, we also studied three different algorithms for performing contrastive learning on these invari-ances, which are described in detail as follows; 1. Standard: transformations from G are uniformly sampled from the sub-groups in the following order: [G t , G 0 , G s ] and applied sequentially to each input. This sampling procedure gives rise to instances which have "highly compounded" invariances; in other words, the two variations of the unit cell (or a positive pair) can look drastically different.
2. Independent: contrastive learning was performed independently for each sub-group of transformations and their losses were summed; i.e. we create a positive pair for each sub-group, compute the NT-Xent loss for each pair and sum them to get the total contrastive loss. This algorithm is slower than the rest (though it can be mitigated via parallel computing) since the number of forward passes per iteration increases according to the number of sub-groups present. However, this sampling procedure create positive instances that differ only by a single sub-group operation and hence their differences are less drastic.
3. Stochastic: Similar to standard, except there is some probability, p α , to whether each transformation sampled from G α will be applied. In our experiments, we set p α = 0.5 for all α's, (i.e. α ∈ {0, t, s}). In other words, there is a 50% chance we do not apply each transformation.
These 3 algorithms are compared in Supplementary Figure 5d, where the stochastic approach is seen to produce the best model accuracies. While the differences are small, the stochastic method is still strictly better than the other two for various combinations of transformations (Supplementary Figure 5d left & right) and statistically significant at larger N t . We conjecture that this is because the standard method uses instances with highly compounded invariances which are difficult to learn, whereas the independent method uses simple invariances governed by single transformations which are easily learnt but omits useful knowledge of compounded invariances. The stochastic method presents as an ideal middle-ground between the two methods. In practice, the individual p α for each sub-group of transformations are hyperparameters that should be tuned for optimization; a detailed tuning of the individual p α will likely widen the performance edge of the stochastic method.

S6. ADDITIONAL COMMENTS ON HYPERPARAMETERS
Apart from the hyperparameters discussed in the Methods section of the main text, SIB-CL involves many other possible training hyperparameters which was not studied in this work. For instance, we simply alternate contrastive learning and predictor pre-training (i.e., Fig. 2ab in the main text) after every epoch; however, one could in principle vary the interval between the two steps, perform the two steps entirely sequentially, or even consider joint-training (i.e. updating the weights of H using a weighted average of the loss from both steps). The initial motivation for training the two steps alternately (instead of sequentially) is to provide some information about the predictive task to H during contrastive learning. This is a key deviation of SIB-CL from standard self-supervised learning (SSL) techniques which are completely task agnostic. We conjecture that SIB-CL will be more effective for predictive modelling problems in science, where the labels are often more sophisticated (and the label space is often larger) than image classes used in vision tasks for standard SSL. For standard SSL, a single linear layer (as opposed to a dense non-linear predictor network used in our case) is often applied after the encoder and is deemed to be sufficient for the classification task; this is a common evaluation procedure in SSL [S3; S4; S7] also better known as the "linear protocol".
Additionally, as suggested in Section S5, the stochastic sampling parameter for the different sub-group of invariances (i.e. p α ) can also be tuned separately (here, we set them all to 0.5 for simplicity). Tuning the frequency and strength of transformations was found to be highly important in contrastive learning in standard computer vision applications [S5; S6] and are thus often exhaustively varied during optimization [S1]. If we were to study the variations discussed here, performances better than the ones presented in this work might be attainable for SIB-CL.