Abstract
Deep learning techniques have been increasingly applied to the natural sciences, e.g., for property prediction and optimization or material discovery. A fundamental ingredient of such approaches is the vast quantity of labeled data needed to train the model. This poses severe challenges in datascarce settings where obtaining labels requires substantial computational or labor resources. Noting that problems in natural sciences often benefit from easily obtainable auxiliary information sources, we introduce surrogate and invarianceboosted contrastive learning (SIBCL), a deep learning framework which incorporates three inexpensive and easily obtainable auxiliary information sources to overcome data scarcity. Specifically, these are: abundant unlabeled data, prior knowledge of symmetries or invariances, and surrogate data obtained at nearzero cost. We demonstrate SIBCL’s effectiveness and generality on various scientific problems, e.g., predicting the densityofstates of 2D photonic crystals and solving the 3D timeindependent Schrödinger equation. SIBCL consistently results in orders of magnitude reduction in the number of labels needed to achieve the same network accuracies.
Introduction
In recent years, there has been increasing interest and rapid advances in applying datadriven approaches, in particular, deep learning via neural networks, to problems in the natural sciences^{1,2,3,4}. Unlike traditional physicsinformed approaches, deep learning relies on extensive amounts of data to quantitatively discover hidden patterns and correlations to perform tasks such as predictive modeling^{4,5}, property optimization^{6,7}, and knowledge discovery^{8,9}. Its success is thus largely contingent on the amount of data available and a lack of sufficient data can severely impair model accuracy. Historically, deep learning applications have overcome this by bruteforce, e.g., by assembling vast curated data sets by crowdsourced annotation or from historical records. Prominent examples include ImageNet^{10} and CIFAR^{11} in computer vision (CV) and WordNet^{12} in natural language processing (NLP) applications. The majority of problems in the natural sciences, however, are far less amenable to this bruteforce approach, partly reflecting a comparative lack of historical data, and partly the comparatively high resourcecost (e.g., time or labor) of synthesizing new experimental or computational data.
A popular approach to alleviate the reliance on labeled data is transfer learning (TL)^{13,14,15,16,17}, which refers to the strategy of finetuning a neural network which has been pretrained on a large labeled source dataset for a target task. TL has been explored and proven effective in various works within the natural sciences ^{18,19,20,21}; however, most works often make use of source data from a different problem^{18,19} or domain^{20,21} thus limiting the efficacy of TL due to the dissimilarity between the source and target problems^{22,23}. In this work, we overcome this limitation by using a prominent feature unique to problems in the natural sciences—that they often benefit from exact and approximate analytical techniques or general insights requiring minimal or no computational cost. While this concept was previously explored by Zhang and Ling^{24}, their method uses kernel ridge regression (KRR) and thus cannot take advantage of inductive biases commonly used in deep learning without specialized kernels.
More recently, an increasingly popular technique is that of selfsupervised learning (SSL)^{25,26}, which primarily differs from TL in that the pretraining stage uses unlabeled rather than labeled data. Specifically, pretext tasks like image rotation prediction^{27} and jigsaw puzzle solving^{28} are invented for the data to provide its own supervision. In particular, contrastive SSL^{26} (or contrastive learning) is an increasingly popular technique where the pretext task is constructed as contrasting between two variations of a sample and other samples, where variations are derived using image transformations. The goal is for the pretrained model to output embeddings where similar (differing) instances are closer (further) in the embedding metric space. In this work, we leverage contrastive learning to invoke symmetries in the problem.
Exploiting physical insights and symmetries has been a highly effective strategy in scientific machine learning. For example, in molecular sciences, symmetry knowledge is often invoked via handcrafted features^{29} or using deep tensor neural networks^{30,31,32,33} with components analytically formulated to respect physical laws and create chemistryrelated inductive biases. In these works, invariance is achieved either through the parameterization of the inputs^{30,31} or due to intrinsic symmetry preservation of the architecture itself^{32,33}. These approaches are, however, highly domainspecific since the architecture is either handcrafted or analytically formulated specifically for atomistic systems. Equivariant neural networks^{34,35,36,37} provides a domainagnostic alternative to exploit symmetries; they have been generalized to spherical images^{38}, volumetric data^{37}, and has been effectively applied to the natural sciences^{39,40} as well. However, the design of such architectures still involves deep technical expertize to mathematically construct the symmetrypreserving specialized kernels. In contrast, contrastive learning provides a domain and modelagnostic approach to exploit symmetries, where a blackbox approach is used to embed physical knowledge instead of through the network architecture or input parameterization, bearing some similarities to conventional data augmentation^{41,42,43,44} strategies. While some applications have been explored in various works^{45,46,47,48}, most are confined within graph architectures; applications to the natural sciences have been scarce, partly owing to the intricacy of designing transformation strategies^{49} suitable for scientific problems.
Here, we introduce Surrogate and Invariance boosted Contrastive Learning (SIBCL), a deep learning framework based on the unique disposition of problems in natural sciences, where auxiliary information sources are often accessible a priori or can be obtained by inexpensive means (see Fig. 1). Specifically, these are: (1) abundant unlabeled data; (2) prior knowledge in the form of invariances of the physical problem, which can be governed by geometric symmetries of the inputs or general nonsymmetry related invariances of the problem; (3) a surrogate dataset on a similar problem that is cheaper to generate, e.g., by invoking simplifications or approximations to the labeling process. SIBCL uses popular deep learning techniques of TL and SSL as effective and broadlyapplicable strategies to incorporate these auxiliary information sources, enabling effective and highquality network training despite data scarcity. SIBCL is applicable to domains where a related and simplified surrogate dataset can be created, common to scientific disciplines where approximate or analytical approaches^{50,51,52} are prevalent. Examples of such domains include the use of classical interatomic force fields in molecular dynamics^{51} and the hierarchy of approximations in densityfunctional theory (DFT) captured by multiple distinct rungs of Jacob’s ladder^{52}. Here, SIBCL’s effectiveness will be demonstrated in various problems in the natural sciences, in particular, on two systems in the fields of photonics and quantum physics calculations.
Results
Surrogate and invarianceboosted contrastive learning (SIBCL)
We seek to train a neural network to predict desired properties (or labels) y from input x using minimal training data. More precisely, for a target problem \({D}_{{{{{{{{\rm{t}}}}}}}}}={\{{{{{{{{{\bf{x}}}}}}}}}_{i},{{{{{{{{\bf{y}}}}}}}}}_{i}\}}_{i = 1}^{{N}_{{{{{{{{\rm{t}}}}}}}}}}\) consisting of N_{t} inputlabel pairs, we focus on problem spaces where N_{t} is too small to successfully train the associated network. To overcome this, we introduce two auxiliary data sets: (1) a set of zerocost unlabeled inputs \({D}_{{{{{{{{\rm{u}}}}}}}}}={\{{{{{{{{{\bf{x}}}}}}}}}_{i}\}}_{i = 1}^{{N}_{{{{{{{{\rm{u}}}}}}}}}}\) and (2) a surrogate data set \({D}_{{{{{{{{\rm{s}}}}}}}}}={\{{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i},{\tilde{{{{{{{{\bf{y}}}}}}}}}}_{i}\}}_{i = 1}^{{N}_{{{{{{{{\rm{s}}}}}}}}}}\) consisting of inexpensively computed labels \({\tilde{{{{{{{{\bf{y}}}}}}}}}}_{i}\) (e.g., from approximation or semianalytical models) with associated input \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\) (possibly, but not necessarily, a “simple” subset of all inputs). The quantity of each of these auxiliary data sets are assumed to far exceed the target problem, i.e., {N_{u}, N_{s}} ≫ N_{t} (and, typically, N_{u} > N_{s}).
On the basis of these auxiliary datasets, we introduce our framework—Surrogate and InvarianceBoosted Constrastive Learning (SIBCL)—that significantly reduces the data requirements on D_{t} (Fig. 2). SIBCL achieves this by splitting the training process into two stages: a first, interleaved twostep pretraining stage using the auxiliary data sets D_{u} and D_{s} (Fig. 2a, b), followed by a finetuning stage using the target data set D_{t} (Fig. 2c).
In the first step of the pretraining stage (Fig. 2a), we exploit contrastive learning to learn invariances in the problem space using unlabeled inputs aggregated from the target and surrogate data sets \({D}_{{{{{{{{\rm{CL}}}}}}}}}={\{{{{{{{{{\bf{x}}}}}}}}}_{i}\}}_{i = 1}^{{N}_{{{{{{{{\rm{u}}}}}}}}}}\cup {\{{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\}}_{i = 1}^{{N}_{{{{{{{{\rm{s}}}}}}}}}}\). We complement D_{CL} by a set of known, physicsinformed invariance relations {g} (which we formally associate with elements of a group \({{{{{{{\mathcal{G}}}}}}}}\)) which map input–label pairs (x_{i}, y_{i}) to (gx_{i}, y_{i}), i.e., to new input with identical labels. We base this step on the SimCLR technique^{53}, though we also explore using the BYOL technique^{54} later (see “Discussion” and SI section S1). Specifically, for each input x_{i} in D_{CL} (sampled in batches of size B), two derived variations gx_{i} and \(g^{\prime} {{{{{{{{\bf{x}}}}}}}}}_{i}\) are created by sampling two concrete mappings g and \(g^{\prime}\) from the group of invariance relations \({{{{{{{\mathcal{G}}}}}}}}\) (see Methods). The resultant 2B inputs are then fed into encoder and then projector networks, H and J respectively, producing metric embeddings \({{{{{{{{\bf{z}}}}}}}}}_{{i}^{({\prime} )}}={{{{{{{\bf{J}}}}}}}}({{{{{{{\bf{H}}}}}}}}({g}^{({\prime} )}{{{{{{{{\bf{x}}}}}}}}}_{i}))\). A positive pair \(\{{{{{{{{{\bf{z}}}}}}}}}_{i},{{{{{{{{\bf{z}}}}}}}}}_{i^{\prime} }\}\) is the pair of metric embeddings derived from the two variations of x_{i}, i.e., gx_{i} and \(g^{\prime} {{{{{{{{\bf{x}}}}}}}}}_{i}\); all other pairings in the batch are considered negative. At each training step, the weights of H and J are simultaneously updated according to a contrastive loss function defined by the normalized temperaturescaled cross entropy (NTXent) loss^{53}:
where \({s}_{ii^{\prime} }={\hat{{{{{{{{\bf{z}}}}}}}}}}_{i}\cdot {\hat{{{{{{{{\bf{z}}}}}}}}}}_{i^{\prime} }\) (and \({\hat{{{{{{{{\bf{z}}}}}}}}}}_{i}={{{{{{{{\bf{z}}}}}}}}}_{i}/\left\Vert {{{{{{{{\bf{z}}}}}}}}}_{i}\right\Vert\)) denotes the cosine similarity between two normalized metric embeddings \({\hat{{{{{{{{\bf{z}}}}}}}}}}_{i}\) and \({\hat{{{{{{{{\bf{z}}}}}}}}}}_{i^{\prime} }\), [i ≠ j] uses the Iverson bracket notation, i.e., evaluating to 1 if i ≠ j and 0 otherwise, and τ is a temperature hyperparameter (fixed at 0.1 here). The total loss is taken as the sum across all positive pairs in the batch. In our batch sampling of D_{CL}, we sample onethird of each batch from D_{s} and twothirds from D_{u}. Conceptually, the NTXent loss acts to minimize the distance between embeddings of positive pairs (numerator of Eq. (1)) while maximizing the distances between embeddings of negative pairs in the batch (denominator of Eq. (1)) Consequently, we obtain representations H(x_{i}) that respect the underlying invariances of the problem.
Each epoch of contrastive learning (i.e., each full sampling of D_{CL}) is followed by a supervised learning step—the second step of the pretraining stage (Fig. 2b)—on the surrogate dataset D_{s}, with each input from D_{s} subjected to a random invariance mapping. This supervised step shares the encoder network H with the contrastive step but additionally features a predictor network G, both updated via a taskdependent supervised training loss function (which will be separately detailed later). This step preconditions the weights of G and further tunes the weights of H to suit the target task.
The pretraining stage is performed for 100–400 epochs and is followed by the finetuning stage (Fig. 2c). This final stage uses D_{t} to finetune the networks H and G to the actual problem task—crucially, with significantly reduced data requirements on D_{t}. Each input from D_{t} is also subjected to a random invariance mapping; the associated supervised training loss function is again problemdependent and may even differ from that used in the pretraining stage.
In the following sections, we evaluate the effectiveness of SIBCL on two problems: predicting the densityofstates (DOS) of twodimensional (2D) photonic crystals (PhCs) and predicting the ground state energy of the threedimensional (3D) noninteracting Schrödinger equation (see SI for additional experiments, including predictions of 2D PhC band structures). To investigate the effectiveness of various auxiliary information sources used in SIBCL, we benchmark our results against the following baselines:

1.
Direct supervised learning (SL): randomly initialized networks H and G are trained using supervised learning on only the target dataset D_{t}. This reflects the performance of conventional supervised learning, i.e., without exploiting any auxiliary data sources.

2.
Conventional transfer learning (TL): networks H and G are first pretrained using supervised learning on the surrogate dataset D_{s} and then subsequently finetuned on D_{t}. This reflects the performance of including surrogate information via conventional transfer learning on a desirable wellaligned transfer task .

3.
Supervised Learning with invariances (SLI): each input is subjected to a transformation randomly sampled from {g} each time before it is fed into network H and trained as per SL. This reflects the performance boost when incorporating invariance information via a standard data augmentation approach.
Finally, to critically evaluate SIBCL’s effectiveness in incorporating these auxiliary information sources, we contrast SIBCL with the combination of items 2 and 3 above, i.e., transfer learning with invariances (TLI). Notably, both SIBCL and TLI leverage an equal level of auxiliary information—both invariances and a simplified surrogate dataset—and vary only in their learning algorithm.
Data generation for 2D photonic crystals
Photonic crystals (PhC) are wavelengthscale periodicallystructured materials, whose dielectric profiles are engineered to create exotic optical properties not found in bulk materials, such as photonic band gaps and negative refractive indices, with wideranging applications in photonics^{55,56}. Prominently among these applications is PhC’s ability to engineer the strength of lightmatter interactions^{56}—or, equivalently, the density of states (DOS) of photonic modes. The DOS captures the number of modes accessible in a spectral range, i.e., the number of modes accessible to an spectrally narrow emitter, directly affecting e.g., spontaneous and stimulated emission rates. However, computing the DOS is expensive: it requires dense integration across the full Brillouin zone (BZ) of the PhC and summation over bands. Below, we demonstrate that SIBCL enables effective training of a neural network for prediction of the DOS in 2D PhCs, using only hundreds to thousands of target samples, dramatically reducing DOScomputation costs. Such neural networks could help to accelerate the design of PhC features, either directly via backpropagation^{57} or by offering a cheap evaluation for multiple invocations of the model, replacing conventional design techniques like topology optimization^{58} and inverse design^{59}.
PhCs are characterized by a periodically varying permittivity, ε(r), whose tiling makes up the PhC’s structure. For simplicity, we consider 2D square lattices of periodicity a with a “twotone” permittivity profile, i.e., ε ∈ {ε_{1}, ε_{2}}, with ε_{i} ∈ [1, 20]. We assume lossless isotropic materials so that ε(r) and the resultant eigenfrequencies are real. We use a levelset of a Fourier sum function (see Methods for details) to parameterize ε(r), discretized to result in a 32 × 32 pixel image, which form the input to the neural network. Special care was taken in the sampling algorithm to create diverse unit cells with features of varying sizes, with examples depicted in Fig. 3a.
We define the DOS of 2D PhCs by^{60}
with ω denoting the considered frequency, ω_{nk} the PhC band structure, n the band index, k the momentum in the BZ and A = a^{2} the unit cell area. In practice, we evaluate Eq. (2) using the generalized Gilat–Raubenheimer (GGR) method^{61}—which incorporates the band group velocity extrapolatively to accelerate convergence—in an implementation adapted from ref. ^{62}. The band structure and associated group velocities are evaluated using the MIT Photonic Bands (MPB) software^{63} for the transverse magnetic (TM) polarized bands (Fig. 3b, also see Methods).
We define labels for our network using the computed DOS values (Fig. 3c) subjected to three simple postprocessing steps (see Methods): (i) spectral smoothing using a narrow Gaussian kernel S_{Δ}, (ii) shifting by the DOS of the “emptylattice” (i.e., uniform lattice of index n_{avg}), \({{{{{{{{\rm{DOS}}}}}}}}}_{{{{{{{{\rm{EL}}}}}}}}}(\omega )=\omega {a}^{2}{n}_{{{{{{{{\rm{avg}}}}}}}}}^{2}/2\pi {c}^{2}\), and (iii) rescaling both DOS and the frequencyvalues by the natural frequency ω_{0} = 2πc/an_{avg}. More explicitly, we define the networkprovided DOS labels as
and sample over the normalized spectral range 0 ≤ ω/ω_{0} ≤ 0.96. Step (i) accounts for the finite spectral width of physical measurements and regularizes logarithmic singularities associated with van Hove points; step (ii) counteracts the linear increase in average DOS that otherwise leads to a bias at higher frequencies, emphasizing instead the local spectral features of the DOS; and step (iii) ensures comparable input and outputranges across distinct unit cells, regardless of the cell’s average index.
In our experiments, we use 20,000 unlabeled unit cells for contrastive learning, select a target dataset of varying sizes N_{t} ∈ [50, 3000] for finetuning, and evaluate the prediction accuracy on a fixed test set containing 2000 samples.
For the surrogate dataset of inexpensive data, D_{s}, we created a simple dataset of 10000 PhCs with centered circular inclusions of varying radii r ∈ (0, a/2] and inclusion and cladding permittivities sampled uniformly in ε_{i} ∈ [1, 20]. This simple class of 2D PhCs is amenable to semianalytical treatments, e.g., KorringaKohn–Rostoker or multiple scattering approaches^{64,65,66,67}, that enable evaluation of the DOS at full precision with minimal computational cost. Motivated by this, we populate the surrogate dataset D_{s} with such circular inclusions and their associated (exact) DOSlabels (here, we computed their labels using MPB directly since we are motivated mainly by proofofprinciple rather than concrete applications—and we had access to a preponderance of computational resources provided by MIT Supercloud).
Invariances of the PhC DOS
The considered PhCs possess no spatial symmetries beyond periodicity. Despite this, as an intrinsic, global quantity (or, equivalently, a kintegrated quantity) the DOS is settingindependent and invariant under all sizepreserving transformations, that is, under all elements of the Euclidean group E(2). For simplicity’s sake, we restrict our focus to the elements of E(2) that are compatible with a pixelized unit cell (i.e., that map pixel coordinates to pixel coordinates). This subset is the direct product of the 4mm (C_{4v}) point group \({{{{{{{{\mathcal{G}}}}}}}}}_{0}\) of the point lattice spanned by \(a\hat{{{{{{{{\bf{x}}}}}}}}}\) and \(a\hat{{{{{{{{\bf{y}}}}}}}}}\) and the group \({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\bf{t}}}}}}}}}\) of pixeldiscrete translations. In more detail:

1.
Point group symmetry (\({{{{{{{{\mathcal{G}}}}}}}}}_{0}\)): 4 mm includes the identity operation (1), 2 and 4fold rotations (C_{2} and \({C}_{4}^{\pm }\)), and horizontal, vertical, and diagonal mirrors (σ_{h}, σ_{v}, and \({\sigma }_{d}^{({\prime} )}\)), i.e., \({{{{{{{{\mathcal{G}}}}}}}}}_{0}=\{1,{C}_{2},{C}_{4}^{},{C}_{4}^{+},{\sigma }_{h},{\sigma }_{v},{\sigma }_{d},{\sigma }_{d}^{\prime}\}\). Formally, this is the PhCs’ holosymmetric point group.

2.
Translation symmetry (\({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\bf{t}}}}}}}}}\)): While the DOS is invariant under all continuous translations t, the pixelized unit cells are compatible only with pixeldiscrete translations; i.e., we consider the (factor) group \({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\bf{t}}}}}}}}}={\{i{N}^{1}a\hat{{{{{{{{\bf{x}}}}}}}}}+j{N}^{1}a\hat{{{{{{{{\bf{y}}}}}}}}}\}}_{i,j = 0}^{N1}\) with N = 32.
Additionally, the structure of the Maxwell equations endows the DOS with two nonEuclidean “scaling” invariances^{55}:

3.
Refractive scaling (\({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}\)): The set of (positive) amplitudescaling transformations of the refractive index g(s)n(r) = sn(r) define a group \({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}=\{g(s)\,\,s\in {{\mathbb{R}}}_{+}\}\) and map the PhC eigenspectrum from ω_{nk} to s^{−1}ω_{nk}. Equivalently, g(s) maps DOS(ω) to sDOS(sω) and thus leaves the ylabels of Eq. (3) invariant under the ω_{0}normalization.

4.
Size scaling (\({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}^{\prime}\)): Analogously, the sizescaling transformations \(g^{\prime} (s){{{{{{{\bf{r}}}}}}}}=s{{{{{{{\bf{r}}}}}}}}\) define a group \({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}^{\prime}=\{g^{\prime} (s)\,\, s\in {{\mathbb{R}}}_{+}\}\), and also map ω_{nk} to s^{−1}ω_{nk} and DOS(ω) to sDOS(sω); i.e., also leaving the ylabels invariant.
Of \({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}\) and \({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}^{\prime}\), only the amplitudescaling \({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}\) is pixelcompatible (\({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}^{\prime}\) can be implemented as a tilingoperation in the unit cell, which, however requires downsampling). Accordingly, we restrict our focus to the pixelcompatible invariances of the product group \({{{{{{{\mathcal{G}}}}}}}}={{{{{{{{\mathcal{G}}}}}}}}}_{0}\times {{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\bf{t}}}}}}}}}\times {{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}\) and sampled its elements randomly. In practice, the samplingfrequency of each element in \({{{{{{{\mathcal{G}}}}}}}}\) is a hyperparameter of the pretraining stage (see Methods and SI section S5).
Prediction of PhC DOS
To assess the trained network’s performance in an easily interpretable setting, we define the evaluation error metric, following ref. ^{62}:
where \({{{{{{{{\rm{DOS}}}}}}}}}_{{{\Delta }}}={S}_{{{\Delta }}}* {{{{{{{\rm{DOS}}}}}}}}={\omega }_{0}^{1}{{{{{{{\bf{y}}}}}}}}+{{{{{{{{\rm{DOS}}}}}}}}}_{{{{{{{{\rm{EL}}}}}}}}}\) and \({{{{{{{{\rm{DOS}}}}}}}}}_{{{\Delta }}}^{{{{{{{{\rm{pred}}}}}}}}}={\omega }_{0}^{1}{{{{{{{{\bf{y}}}}}}}}}^{{{{{{{{\rm{pred}}}}}}}}}+{{{{{{{{\rm{DOS}}}}}}}}}_{{{{{{{{\rm{EL}}}}}}}}}\) are the true and predicted S_{Δ}smoothened DOS, respectively, and the sums are over the spectral range 0.24 ≤ ω/ω_{0} ≤ 0.96 (we omit the spectral region 0 ≤ ω/ω_{0} < 0.24 during evaluation to get a more critical metric, since the DOS has no significant features there). The network architecture and training details (loss functions, hyperparameters, layers etc.) are discussed in the Methods section.
The performance of SIBCL under this error measure is evaluated in Fig. 4 and contrasted with the performance of the baselines. In practice, to minimize the fluctuations due to sample selection, we show the mean of \({{{{{{{{\mathcal{L}}}}}}}}}^{{{{{{{{\rm{eval}}}}}}}}}\) for three separate finetuning stages on distinct randomlyselected datasets of size N_{t}, evaluated on a fixed test set.
A significant reduction of prediction error is observed for SIBCL over the baselines, especially for few finetuning samples: e.g., at N_{t} = 100, SIBCL has 4.6% error while SL, SLI and TL have 7.6, 7.1 and 6.9% error respectively. More notably, we see a large reduction in the number of finetuning samples N_{t} needed to achieve the same level of prediction error, which directly illustrates the data savings in the datascarce problem. We obtain up to 9×(7×) savings in N_{t} when compared to SL (SLI or TL) at a target prediction error of ~5.1%. These savings highlight the effectiveness of SIBCL over simple supervised learning (SL) as well as techniques leveraging a single source of auxiliary information, here represented by surrogatebased TL or invarianceaugmented SL (SLI). The predicted and true DOS are compared as functions of frequency in Fig. 4b across a range of error levels as realized for different unit cell input. Further, SIBCL also incorporates the combined invariance and surrogate information more effectively than is achievable e.g., by incorporating the invariance information as data augmentation in surrogatebased transfer learning (invarianceaugmented transfer learning, TLI). This is demonstrated by Table 1, where a steady performance advantage of SIBCL over TLI is observed.
A strong motivation for exploring deep learning methods in scientific predictive modeling is to accelerate design processes, since trained neural networks are able to offer cheap evaluations of the target problem; it is thus instructive to assess the computational savings of a trained network. For the DOS problem, the inference time of our trained neural network takes 0.005s on a single Intel Xeon Gold 6148 CPU core, while the traditional numerical method takes 14.5s for a single photonic crystal on the same hardware, resulting in a ≈ 3000 factor speed up. Such savings are highly significant, particularly for design optimization applications where a huge number of forward predictions are often necessary.
To demonstrate that the effectiveness of SIBCL extends beyond the DOS prediction problem, we also trained a network using SIBCL and all baselines to predict the PhC band structure (see SI section S2). For this task, the network labels y are ω_{nk}/ω_{0}, sampled over a 25 × 25 kgrid and over the first 6 bands, i.e., \({{{{{{{\bf{y}}}}}}}}\in {{\mathbb{R}}}_{\ge 0}^{6\times 25\times 25}\), while the input labels x remain unchanged. Unlike the DOS, the band structure is not invariant under the elements of \({{{{{{{{\mathcal{G}}}}}}}}}_{0}\), but remains invariant under translations (\({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\bf{t}}}}}}}}}\)) and refractive amplitude scaling (\({{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}\)), i.e., \({{{{{{{\mathcal{G}}}}}}}}={{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\bf{t}}}}}}}}}\times {{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}\). Also for this task, we found SIBCL to enable significant data savings, ranging up to 60× relative to the SL baseline.
Data generation for 3D Schrödinger equation
As a test of the applicability of SIBCL to higherdimensional problems, we consider predicting the ground state energies of the singleparticle, timeindependent Schrödinger equation (TISE) for random 3D potentials in box. This problem demonstrates a proofofprinciple application to the domain of electronic structure calculations, which is of fundamental importance to the understanding of molecules and materials across physics, chemistry, and material science.
The eigenstates ψ_{n} and eigenenergies E_{n} of a (nonrelativistic) singleparticle electron in a potential U(r) are the eigensolutions of the TISE:
where \(\hat{H}=\hat{T}+\hat{U}\) is the Hamiltonian consisting of kinetic \(\hat{T}=\frac{1}{2}{\nabla }^{2}\) and potential energy \(\hat{U}=U({{{{{{{\bf{r}}}}}}}})\) contributions. Here, and going forward, we work in Hartree atomic units (h.a.u.). For simplicity, we consider random potentials U(r) confined to a cubic box of side length 10 Bohr radii (a_{0}), with values in the range [0, 1] Hartree (see Methods for details). Examples of the generated potentials are shown in Fig. 5a (left).
We associate the network input–label pairs (x, y) with the potentials U(r) (sampled over a 32 × 32 × 32 equidistant grid) and ground state energies E_{0}, respectively. We evaluate E_{0} by using (central) finite differences with implicit Dirichlet boundary conditions to discretize Eq. (5), which is subsequently solved using an iterative sparse solver^{68}. The target dataset D_{t} is computed using a 32 × 32 × 32 finitedifferences discretization, with an estimated mean numerical error ≈0.1% (Fig. 5b, left).
In the previously considered PhC DOS problem, the surrogate dataset D_{s} was built from a particularly simple input class with exact and inexpensive labels. Here, instead, we assemble D_{s} by including the original range of inputs x but using approximate labels \(\tilde{y}\). In particular, we populate the surrogate dataset with input–label pairs \(({{{{{{{\bf{x}}}}}}}},\tilde{y})\), with \(\tilde{y}={\tilde{E}}_{0}\) computed from a lowresolution finitedifference 5 × 5 × 5 discretization of U(r) (Fig. 5b). \({\tilde{E}}_{0}\) has a relatively high error of ~10% (Fig. 5b, right) but is orders of magnitude faster to compute: e.g., a naive power iteration eigensolver requires O(n^{2}) operations per iteration (with \(n=N^{3}\) denoting the number of gridvoxels and N the gridpoints per dimension), such that iterations at N = 5 require ~ 10^{5}fold less work than at N = 32.
To assess the impact of the choice of surrogate data, we also examine an alternative surrogate dataset, with inputlabel pairs \((\tilde{{{{{{{{\bf{x}}}}}}}}},\tilde{y})\), derived from quantum harmonic oscillator (QHO) potentials:
where \({({{{{{{{{\bf{A}}}}}}}}}^{\circ n})}_{i}={A}_{i}^{n}\) is the Hadamard (elementwise) power operator. We define the associated surrogate labels by the openboundary QHO energies, i.e., by \(\tilde{y}={\tilde{E}}_{0}=\frac{1}{2}{\sum }_{i}{\omega }_{i}\), and assign the input \(\tilde{{{{{{{{\bf{x}}}}}}}}}\) by the inbox grid discretization of \(\tilde{U}({{{{{{{\bf{r}}}}}}}})\). The \(\tilde{y}\) labels consequently reflect an example of analytically approximated labels (here, with approximationerror due to the neglect of the Dirichlet boundary conditions); see SI section S3. For quicker training of the network, we use the 2D version of the TISE with this surrogate dataset (i.e., D_{s} and D_{t} consist of 2D QHO potentials and 2D random potentials respectively).
Prediction of the groundstate energy of 3D Schrödinger equation
The groundstate energy is invariant under elements of the symmetry point group, i.e., \({{{{{{{\mathcal{G}}}}}}}}={{{{{{{{\mathcal{G}}}}}}}}}_{0}\) in 2D. In 3D, we instead have the m\(\overline{3}\)m point group, which notably has 48 elements (instead of just 8 in \({{{{{{{{\mathcal{G}}}}}}}}}_{0}\)).
Figure 5c shows the results using the surrogate dataset of reduced resolution data, compared against the baselines. We observe up to 40 × data savings for SIBCL when compared to SL. Additionally, consistently with our PhC experiments in Table 1, SIBCL also here outperforms invarianceaugmented TL (SI section S3). As a validation step, the prediction accuracies are noted to be in the orders of ≈1%, making the surrogate (target) dataset with ≈10% (≈0.1%) numerical error an appropriate design choice as approximate (target) data. For the experiments using the QHO surrogate dataset, we obtain up to 4 × savings when using SIBCL compared to SL (see SI section S3); the data savings are diminished, within expectations, since the QHO dataset is way simpler and contains less information to transfer.
Discussion
The widespread adoption and exploitation of datadriven techniques, most prominently deep learning via neural networks, to scientific problems has been fundamentally limited by a relative data scarcity. That is, data is only rarely available in the quantities required to train a network to faithfully reproduce the salient features of nontrivial scientific problems; moreover, even if such data can be generated, it typically requires excessive computational effort. Here, we have introduced SIBCL, a framework that overcomes these fundamental challenges by incorporating prior knowledge and auxiliary information, including problem invariances, “cheap” problem classes, and approximate labels. With SIBCL, the required quantities of costly, highquality training data is substantially reduced, opening the applicationspace of datadriven techniques to a broader class of scientific problems.
We demonstrated the versatility and generality of SIBCL by applying it to problems in photonics and electronic structure, namely to the prediction of the DOS and band structures of 2D PhCs and the ground state energies of the TISE. Through our experiments, we demonstrated that even very simple sources of auxiliary information can yield significant data savings. For instance, the group of invariances \({{{{{{{\mathcal{G}}}}}}}}\) can be just a set of simple rotations and mirrors as in the TISE problem. Similarly, there are diverse options for constructing the surrogate dataset: here, we explored the use of simplified structures where (semi) analytical solutions exist (e.g., circular structures of PhCs), approximate calculations of the target problem (e.g., reduced resolution computations of TISE), and even a combination of the two (e.g., approximated energies of QHO potentials in the TISE problem). Most natural science disciplines, especially physics, have deep and versatile caches of such approximate and analytical approaches which can be drawn from to create suitable surrogate datasets.
In the problems studied here, SIBCL outperformed all baselines (including invarianceaugmented baselines, see SI section S4). We conjecture that SIBCL’s performance advantage stems predominantly from enforcing problem invariances via a contrastive loss, which is more effective than naive data augmentation (cf. the performance edge of SIBCL over TLI). To substantiate this hypothesis, we performed several ablation experiments (see SI Section S5). Firstly, when all invariance information are disregarded in SIBCL (i.e., if the group of invariances \({{{{{{{\mathcal{G}}}}}}}}\) is reduced to the trivial identity group), we observe very similar performance to TL. This demonstrates that the constrastive stage is only effective in combination with invariance information, or, equivalently, that the utility of the contrastive stage hinges strongly on attracting nontrivial positive pairs rather than merely repelling negative pairs.
Next, we explored the consequences of selectively and increasingly removing invariances from \({{{{{{{\mathcal{G}}}}}}}}\). We found that including more invariances strictly improves SIBCL’s performance, consistent with expectations since the elements of \({{{{{{{\mathcal{G}}}}}}}}\) are true invariances of the downstream task. This is contrary to the standard selfsupervised learning (SSL) paradigm which is taskagnostic, i.e., the downstream task is not known during the contrastive learning stage, and transformations may not be true invariances of the downstream problem so including more transformations can sometimes be harmful^{49,53,69}. To determine the relative and combined efficacy of SIBCL’s usage of transfer learning and contrastive learning, we performed ablation experiments (see SI section S5). From these experiments, we establish both the constrastive and transfer learning stages contribute significantly to the accuracy of SIBCL, with the constrastive stage being the dominant contribution. Pretraining has currently become a standard approach in deep learning with TL and SSL being popular instantiations. TL relies on labeled data leading to task specification while SSL relies on unlabeled data and auxiliary pretext tasks to derive representations that generalize well due to its nonspecificity to any particular task. The effectiveness of SIBCL stems from combining desirable features from each of these pretraining techniques into a coherent framework. More concretely, the use of TL alone may lead to undesirable overfitting to the simplified surrogate labels; concurrently, the higherdimensional nature of labels common in scientific applications may render SSL techniques to be ineffective if transformations provided via pretext tasks does not provide sufficient context to learn the final predictive task. The combination of unsupervised objectives and endtask objectives can lead to improved performance and data efficiency^{70}; this is further exemplified in SI Section S5, showing the gains of SIBCL over its individual components.
While contrastive learning has gained enormous popularity in recent years, its techniques has mainly found applications in computer vision tasks (e.g., image classification on ImageNet^{10}) while its utility to regression problems has remained largely unexplored. Techniques like SimCLR are based on instance discrimination, i.e., the network is trained to discriminate between negative pairs in the batch. Intuitively, such techniques may seem less wellsuited to regression problems where the latent space is often continuous rather than discrete or clustered as in classification problems. Indeed, we made several empirical observations that disagree with the findings of standard contrastive learning applications on classification problems. Notably, it is widely corroborated^{53,71,72} that using a larger batch size is always more beneficial, which can be interpreted as the consequence of having more negative pairs for instance discrimination. This empirical finding was not echoed in our experiments, thus suggesting that instance discrimination may not be highly appropriate in regression problems. Motivated by this, we also explored the BYOL technique^{54} which is not based on instance discrimination and does not use explicit negative pairs in its loss function (see SI section S1), but found no performance advantage. Despite many empirical successes, SSL methods remains poorly understood and lacks a solid theoretical explanation^{49,73,74,75} for why and when these algorithms work well. Our work further underscores and motivates the need to develop such an improved foundation, not only to address the noted deviations from expectations but also to guide the emerging application of contrastive learning techniques to regressions tasks.
Exploiting prior knowledge of symmetries and physical insights has shown to be highly effective for deep learning in the scientific domain. For instance, architectures with handcrafted or analytically formulated components are commonly used in molecules to invoke chemically meaningful inductive biases or to respect quantum–mechanical properties^{30,31,33,76}. There also exists a growing body of work on equivariant networks for various symmetry groups^{35,36,37,77}, particularly for applications in the natural sciences^{40,78}, of which our work is highly complementary to. These works are mainly motivated by the fact that the exploitation of symmetry or physical insights provides a strong inductive bias, which constrains the space of possible models or allow it to properly model physical limits, ultimately achieving better predictive accuracy and higher data efficiency. Like these networks, SIBCL also aims to create a network that exploits underlying symmetries and known physical invariances of the problem. However, rather than hardcoding invariance information into the model architecture, the process is implemented/achieved organically via contrastive learning. The price paid for this more generic approach, is that feature invariance to the symmetry group \({{{{{{{\mathcal{G}}}}}}}}\) is only approximately achieved—to a degree expressed indirectly by the NTXent loss (Eq. (1))—rather than exactly as is the case for hardcoded problemspecific architectures. Conversely, SIBCL has the advantage of being simple and readily generalizable to any known invariance, i.e., requires no specialized kernels or mathematical construction, and can readily incorporate additional invariances without changes to the underlying architecture. Given the ubiquity and impact of symmetrypreserving equivariant architectures in scientific deep learning, we compared SIBCL against one prominent architecture in this domain, namely the E(2)equivariant CNNs proposed by Weiler and Cesa^{79} (see SI section S7). Our experiments show that SIBCL remains competitive, and even outperforms, such equivariant architectures. Relatedly, SIBCL’s superior performance over TLI (Table 1) similarly suggests that using contrastive learning to enforce invariances is likely to be more effective than naive data augmentation.
Our work provides insights on how issues of data scarcity can be overcome by leveraging sources of auxiliary information in natural science problems. The SIBCL framework presented in this work demonstrates how such auxiliary information can be readily and generically incorporated in the network training process. Our work also provides insights on the thusfar lessexplored application of contrastive learning for regression tasks, opening up opportunities for applications in several domains dominated by regression problems, in particular, the natural sciences. Finally, we note that SIBCL was developed with the motivation that many problems in the natural sciences are endowed with approximate or analytical approaches that can be used to create a surrogate dataset at low computation cost and thus are limited to domains as such.
Methods
PhC unit cells and DOS processing
We parameterize ε(r) by choosing a level set of a Fourier sum function ϕ, defined as a linear sum of plane waves with frequencies evenly spaced in the reciprocal space (up to some cutoff). i.e.,
where each n_{k} is a 2D vector (n_{x}, n_{y}) and we used 3 Fourier components per dimension, i.e., n_{x}, n_{y} ∈ [ − 1, 0, 1] (and thus the summation index k runs over 9 terms). c_{k} is a complex coefficient, c_{k} = re^{iθ} with r, θ separately sampled uniformly in [0, 1). Finally, we uniformly sample a filling fraction, defined as the fraction of area in the unit cell occupied by ε_{1}, in [0, 1) to determine the level set Δ so as to obtain the permittivity profile:
This procedure produces periodic unit cells with features of uniformly varying sizes due to the uniform sampling of the filling ratio and without strongly divergent feature scales thus corresponding to fabricable designs.
With the MIT Photonic Bands (MPB) software^{63}, we use 25 × 25 plane waves (and also a 25 × 25 kpoint resolution) over the Brillouin zone −π/a < k_{x,y} ≤ π/a to compute the band structure of each unit cell up to the first 10 bands and also extract the group velocities at each kpoint. We then computed the DOS for ω/ω_{0} ∈ [0, 0.96] over 16000 equidistantlyspaced frequency samples using the generalized Gilat–Raubenheimer (GGR) method^{61,62}. Next, we computed the S_{Δ}smoothened DOS, i.e., \({{{\mathrm{DOS}}}}_{\Delta}=S_{\Delta} * {{{\mathrm{DOS}}}}\), using a Gaussian filter \({S}_{{{\Delta }}}(\omega )={{{{{{{{\rm{e}}}}}}}}}^{{\omega }^{2}/2{{{\Delta }}}^{2}}/\sqrt{2\pi }{{\Delta }}\) of spectral width Δ = 0.006ω_{0}. Before defining the associated network labels y, we downsampled DOS_{Δ} to 400 frequency points. Finally, the network ylabels are constructed according to Eq. (3), i.e., by subtracting the background “emptylattice” DOS—i.e., \({{{{{{{{\rm{DOS}}}}}}}}}_{{{{{{{{\rm{EL}}}}}}}}}(\omega )={a}^{2}{n}_{{{{{{{{\rm{avg}}}}}}}}}^{2}\omega /2\pi {c}^{2}\), the DOS of a uniform unit cell Ω of index \({n}_{{{{{{{{\rm{avg}}}}}}}}}=\frac{1}{ {{\Omega }} }{\int}_{{{\Omega }}}n({{{{{{{\bf{r}}}}}}}})\,{{{{{{{{\rm{d}}}}}}}}}^{2}{{{{{{{\bf{r}}}}}}}}\)—and rescaling by ω_{0}. Subtracting DOS_{EL} removes a highfrequency bias during training and was found to improve overall network accuracy.
3D TISE unit cells
To generate samples of U(r), we follow the same procedure in Eqs. (7) and (8) to first create twotone potential profiles in 3D, i.e., r = (x, y, z) and n_{k} = (n_{x}, n_{y}, n_{z}) are now 3D vectors. We create finer features by increasing the number of Fourier components to n_{x}, n_{y}, n_{z} ∈ [−2, −1, 0, 1, 2] (and hence the summation in Eq. (7) now runs over 125 terms). We also modify the range of potential, i.e., ε_{1} in Eq. (8) is set to 0, while ε_{2} is uniformly sampled in [0, 1]. The periodicity is removed by truncating 20% of the unit cell from each edge. A Gaussian filter with a kernel size 8% of the (new) unit cell is then applied to smooth the potential profile and, finally, the unit cells are discretized to a resolution of 32 × 32 × 32. This procedure is illustrated in SI section S3 and is similarly used to produce the 2D unit cells, discretized to 32 × 32, when using the QHO surrogate dataset. The ratio between the length scale and potentials’ dynamic range was also carefully selected to produce nontrivial wavefunctions, so as to create a meaningful deep learning problem (see SI section S3 for further discussion).
Model architecture
Our encoder network, H consists firstly of 3 to 4 convolutional neural network (CNN) layers followed by 2 fully connected (FC) layers, where the input after the CNNs was flattened before being fed into the FC layers. The channel dimensions in the CNN layers and number of nodes in the FC layers vary for the different problems, and are listed in Table 2. For TISE, the CNN layers have 3D kernels to cater for the 3D inputs, while the CNNs for the remaining problems uses regular 2D kernels used in standard image tasks. For the predictor network, G, we used 4 FC layers for all the problems, with number of nodes listed in Table 3. The predictor network for the band structure problem consists of 6 blocks of the same layer architecture, each block leading to each of the 6 bands and separately updated using the loss from each band during training. A similar architecture was used in previous work^{5}. We included BatchNorm^{80}, ReLU^{81} activations and MaxPooling between the CNN layers, and ReLU activations between all the FC layers in H and G. For the projector network J, we used 2 FC layers with hidden dimension 1024 and ReLU activation between them; the final metric embeddings have dimension 256. J is fixed across all problems. Using the DOS prediction problem, we also experimented with deeper projector networks (i.e., increasing to 4 FC layers with the same hidden dimensions), as well as including BatchNorm between the layers, and found small improvements.
Invariance sampling during contrastive learning
In conventional contrastive learning applications in computer vision (CV), different instances of the input are often created via a preoptimized, sequential application of various data augmentation strategies such as random cropping, color distortion, and Gaussian blur^{53,54}. Adopting this technique, we also apply transformations from each subgroup of \({{{{{{{\mathcal{G}}}}}}}}\) in the randomly determined order \([{{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\bf{t}}}}}}}}},{{{{{{{{\mathcal{G}}}}}}}}}_{0},{{{{{{{{\mathcal{G}}}}}}}}}_{{{{{{{{\rm{s}}}}}}}}}]\) and, additionally, experimented various algorithms for performing contrastive learning; see SI Section S5. We find that introducing stochasticity in transformation application is an effective strategy and thus use it in SIBCL. More specifically, for each subgroup \({{{{{{{{\mathcal{G}}}}}}}}}_{\alpha }\), with α ∈ {0, t, s}, we set a probability p_{α} to which any nonidentity transformation is applied. (Equivalently, inputs are not transformed with probability (1 − p_{α}).) {p_{α}} is a set of hyperparameters that are often intricately optimized for in standard CV applications (among other hyperparameters such as the order and strength of augmentations); here, for simplicity, we omitted this optimization step. We set p_{α} = 0.5 for all α’s, and sampled the elements uniformly, i.e., each transformation in \({{{{{{{{\mathcal{G}}}}}}}}}_{\alpha }\) is applied with probability 0.5/m_{α} with m_{α} being the total number of nonidentity elements in \({{{{{{{{\mathcal{G}}}}}}}}}_{\alpha }\).
PhC DOS prediction loss functions
In step b of the pretraining stage where we trained using supervised learning loss on D_{s} (Fig. 2b), we used the pretraining loss function
for each sample in the batch, where y^{pred} and y are the network prediction and the true label of that sample respectively and ∣ ⋅ ∣ gives the elementwise absolute value. We take the mean over the (normalized) frequency axis (ω/ω_{0}) to get a scalar for \({{{{{{{{\mathcal{L}}}}}}}}}^{PT}\). This loss function was used during pretraining (for SIBCL and the TL baselines); its purpose is to encourage the network to learn from the surrogate dataset the general features in the DOS spectrum and underemphasize the loss at places where the DOS diverges, i.e., at the Van Hove singularities. In our experiments, we found that \({{{{{{{{\mathcal{L}}}}}}}}}^{PT}\) indeed gave better prediction accuracies than the standard L1 or mean squared error (MSE) loss functions. After the pretraining step, the standard L1 loss function was used during finetuning on D_{t} (Fig. 2c) for SIBCL and all the baselines.
PhC band structure prediction loss functions
During supervised training (for both pretraining and finetuning), we use the MSE loss function; for evaluation, we use a relative error measure (for easier interpretation) given by,
where ω_{n}(k) are the eigen frequencies indexed over band numbers n = 1, 2, ..., 6 and k are the wave vectors restricted to the Brillouin zone, i.e., −π/a < k_{x,y} ≤ π/a. The evaluation loss is taken as the mean over all 6 bands and over all (25 × 25) kpoints.
Groundstate energy prediction loss functions
The MSE loss function is similarly used during both the pretraining and finetuning stages of supervised training of the groundstate energy prediction problem. During evaluation, we use a simple relative error measure,
where y^{pred} is the network prediction and y = E_{0} is the recorded groundstate energy, for each sample in the test set.
Training hyperparameters
For training the networks in all problems, we used Adam optimizers^{82}, with learning rates for the different steps specified in Table 4. We also use an adaptive learning rate scheduler for the finetuning stage. Even though standard contrastive learning methods implement a cosine annealing scheduler^{83}, we found that this was not beneficial for SIBCL on our problems and thus omitted it. Additionally, in order to prevent networks H and G from overfitting to the surrogate dataset, we explored various conventional regularization techniques during the pretraining stage, such as weight decay and dropout. We found that these were not beneficial; instead, we used early stopping where we saved the pretrained model at various epochs and performed the finetuning stage on all of them, picking only the best results to use as the final performance. For SIBCL, the pretrained model was saved at {100, 200, 400} epochs, and for TL (both with and without invariances), the pretrained model was saved at {40, 100, 200} epochs. (See SI section S6 for more details on the checkpoint choices). Finally, another important hyperparameter in our experiments is the kernel size (n_{k}) of the CNN layers; apart from optimizing the learning process, this hyperparameter can be used to adjust the network size. This is important in our experiments since we are training/finetuning on varying sizes N_{t} of the target dataset; a smaller (bigger) dataset is likely to need a smaller (bigger) network for optimal results. For the DOS prediction, we varied n_{k} ∈ {5, 7}; for band structures, n_{k} ∈ {7, 9, 11} and for TISE, n_{k} ∈ {5, 7}. The same set of n_{k} was applied for both SIBCL and all baselines in every problem. Apart from those mentioned here, SIBCL involves many other hyperparameters not explored here; see additional comments in SI section S6.
Data availability
PhC band structures were computed using MPB^{63}. DOS calculations were carried out using the GGR method, adapted from the MATLAB implementation in ref. ^{62}. Numerical solution of the TISE groundstate energies was implemented in Python using SciPy^{84}. The datasets generated in this study and source codes used to generate them are available via the code repository at https://github.com/clott3/SIBCL^{85}.
Code availability
The neural networks were implemented and trained using the PyTorch framework^{86}. All source codes used for training, for both SIBCL and the baselines presented in this work, are publicly available at https://github.com/clott3/SIBCL^{85}.
References
Agrawal, A. & Choudhary, A. Deep materials informatics: Applications of deep learning in materials science. MRS Commun. 9, 779–792 (2019).
Mater, A. C. & Coote, M. L. Deep learning in chemistry. J. Chem. Inf. Modeling 59, 2545–2559 (2019).
Tanaka, A., Tomiya, A. & Hashimoto, K. Deep Learning and Physics (Springer Nature, 2021).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Christensen, T. et al. Predictive and generative machine learning models for photonic crystals. Nanophotonics 9, 4183–4192 (2020).
Kim, S. et al. Scalable and flexible deep Bayesian optimization with auxiliary information for scientific problems. Preprint at http://arxiv.org/abs/2104.11667 (2021).
Ahn, S., Kim, J., Lee, H. & Shin, J. Guiding deep molecular optimization with genetic exploration. In Advances in Neural Information Processing Systems 33, 12008–12021 (2020).
Jha, D. et al. ElemNet : Deep learning the chemistry of materials from only elemental composition. Sci. Rep. 8, 17593 (2018).
Lu, P. Y., Kim, S. & Soljačić, M. Extracting interpretable physical parameters from spatiotemporal systems using unsupervised learning. Phys. Rev. X 10, 031056 (2020).
Deng, J. et al. ImageNet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (2009).
Krizhevsky, A., Nair, V. & Hinton, G. Cifar10 (canadian institute for advanced research) http://www.cs.toronto.edu/kriz/cifar.html (2009).
Miller, G. A. WordNet: A lexical database for English. Commun. ACM 38, 39–41 (1995).
Hutchinson, M. L. et al. Overcoming data scarcity with transfer learning. Preprint at http://arxiv.org/abs/1711.05099 (2017).
Tan, C. et al. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN, 270–279 (2018).
Pan, SinnoJialin & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Glorot, X., Bordes, A. & Bengio, Y. Domain adaptation for largescale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on International Conference on Machine Learning 513–520 (2011).
Bengio, Y. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning 17–36 (2012).
Qu, Y., Jing, L., Shen, Y., Qiu, M. & Soljačić, M. Migrating knowledge between physical scenarios based on artificial neural networks. ACS Photonics 6, 1168–1174 (2019).
Yamada, H. et al. Predicting materials properties with little data using shotgun transfer learning. ACS Cent. Sci. 5, 1717–1730 (2019).
Lubbers, N., Lookman, T. & Barros, K. Inferring lowdimensional microstructure representations using convolutional neural networks. Phys. Rev. E 96, 052111 (2017).
Li, X. et al. A transfer learning approach for microstructure reconstruction and structureproperty predictions. Sci. Rep. 8, 13461 (2018).
Rosenstein, M. T., Marx, Z., Kaelbling, L. P. & Dietterich, T. G. To transfer or not to transfer. In NIPS’05 Workshop, Inductive Transfer: 10 Years Later (2005).
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27 (2014).
Zhang, Y. & Ling, C. A strategy to apply machine learning to small datasets in materials science. npj Comput. Mater. 4, 1–8 (2018).
Jing, L. & Tian, Y. Selfsupervised visual feature learning with deep neural networks: A survey. Preprint at http://arxiv.org/abs/1902.06162 (2019).
Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D., & Makedon, F. A survey on contrastive selfsupervised learning. Technologies 9, 2 (2021).
Gidaris, S., Singh, P. & Komodakis, N. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (2018).
Noroozi, M. & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV (2016).
Behler, J. Atomcentered symmetry functions for constructing highdimensional neural network potentials. J. Chem. Phys. https://doi.org/10.1063/1.3553717 (2011).
Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R., & Tkatchenko, A. Quantumchemical insights from deep tensor neural networks. Nat. Commun. https://doi.org/10.1038/ncomms13890 (2017).
Schütt, K. T., Sauceda, H. E., Kindermans, P.J., Tkatchenko, A. & Müller, K.R. Schnet—a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).
Schütt, K. T., Unke, O. T., & Gastegger, M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In Proceedings of the 38th International Conference on Machine Learning, PMLR 139 (2021).
Unke, O. T. et al. Spookynet: Learning force fields with electronic degrees of freedom and nonlocal effects. Nat. Commun. https://doi.org/10.1038/s41467021275040 (2021.
Ling, J., Jones, R. & Templeton, J. Machine learning strategies for systems with invariance properties. J. Comput. Phys. https://doi.org/10.1016/j.jcp.2016.05.003 (2016).
Cohen, T. S. & Welling, M. Group equivariant convolutional networks. In Proceedings of the 33rd International Conference on Machine Learning, PMLR 48, 2990–2999 (2016).
Thomas, N. et al. Tensor field networks: Rotation and translationequivariant neural networks for 3D point clouds. Preprint at http://arxiv.org/abs/1802.08219 (2018).
Weiler, M., Geiger, M., Welling, M., Boomsma, W. & Cohen, T. 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems 32 (2018).
Cohen, T. S., Geiger, M., Köhler, J., & Welling, M. Spherical CNNs. In Proceedings of the 6th International Conference on Learning Representations ICLR (2018).
Simm, G. N. C., Pinsler, R., Csányi, G., & HernándezLobato, J. M. Symmetryaware actorcritic for 3d molecular design. In International Conference on Learning Representations (2021).
Batzner, S. et al. E(3)equivariant graph neural networks for dataefficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Mikołajczyk, A. & Grochowski, M. Data augmentation for improving deep learning in image classification problem. In 2018 International Interdisciplinary PhD Workshop (IIPhDW) 117–122 (2018).
Shorten, C. & Khoshgoftaar, T. M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 6, 60 (2019).
CortesCiriano, I. & Bender, A. Improved chemical structureactivity modeling through data augmentation. J. Chem. Inf. Modeling 55, 2682–2692 (2015).
Oviedo, F. et al. Fast and interpretable classification of small Xray diffraction datasets using data augmentation and deep neural networks. npj Comput. Mater. https://doi.org/10.1038/s415240190196x (2019).
Wang, Y., Wang, J., Cao, Z. & Farimani, A. B. MolCLR: Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).
Wetzel, S. J., Melko, R. G., Scott, J., Panju, M. & Ganesh, V. Discovering symmetry invariants and conserved quantities by interpreting siamese neural networks. Phys. Rev. Res. 2, 033499 (2020).
You, Y. et al. Graph contrastive learning with augmentations. In Advances in Neural Information Processing Systems 33 (2020).
Kiyasseh, D., Zhu, T. & Clifton, D. A. CLOCS: contrastive learning of cardiac signals. Preprint at https://arxiv.org/abs/2005.13249 (2020).
Tian, Y. et al. What makes for good views for contrastive learning? In Advances in Neural Information Processing Systems 33 (2020).
Johnson, E. R. & Becke, A. D. A postHartree–Fock model of intermolecular interactions. J. Chem. Phys. 123, 024101 (2005).
González, M. A. Force fields and molecular dynamics simulations. EDP Sci. 12, 169–200 (2011).
Perdew, J. P. & Schmidt, K. Jacob’s ladder of density functional approximations for the exchangecorrelation energy. AIP Conf. Proc. 577, 1–20 (2001).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning PMLR 119 (2020).
Grill, J.B. et al. Bootstrap your own latent: A new approach to selfsupervised learning. In Advances in Neural Information Processing Systems 33 (2020).
Joannopoulos, J. D., Johnson, S. G., Winn, J. N. & Meade, R. D. Photonic Crystals: Molding the Flow of Light 2nd edn (Princeton University Press, 2008).
Yablonovitch, E. Inhibited spontaneous emission in solidstate physics and electronics. Phys. Rev. Lett. 58, 2059–2062 (1987).
Peurifoy, J. et al. Nanophotonic particle simulation and inverse design using artificial neural networks. Sci. Adv. 4, eaar4206 (2018).
Jensen, J. S. & Sigmund, O. Topology optimization for nanophotonics. Laser Photonics Rev. 5, 308–321 (2011).
Molesky, S. et al. Inverse design in nanophotonics. Nat. Photonics 12, 659–670 (2018).
Novotny, L. & Hecht, B. Principles of NanoOptics (Cambridge University Press, 2012).
Gilat, G. & Raubenheimer, L. J. Accurate numerical method for calculating frequencydistribution functions in solids. Phys. Rev. 144, 390 (1966).
Liu, B., Johnson, S. G., Joannopoulos, J. D. & Lu, L. Generalized Gilat–Raubenheimer method for densityofstates calculation in photonic crystals. J. Opt. 20, 044005 (2018).
Johnson, S. G. & Joannopoulos, J. D. Blockiterative frequencydomain methods for Maxwell’s equations in a planewave basis. Opt. Express 8, 173–190 (2001).
Ohtaka, K. Energy band of photons and lowenergy photon diffraction. Phys. Rev. B 19, 5057–5067 (1979).
Wang, X., Zhang, X.G., Yu, Q. & Harmon, B. N. Multiplescattering theory for electromagnetic waves. Phys. Rev. B 47, 4161–4167 (1993).
Moroz, A. Densityofstates calculations and multiplescattering theory for photons. Phys. Rev. B 51, 2068–2081 (1995).
Moroz, A. Metallodielectric diamond and zincblende photonic crystals. Phys. Rev. B 66, 115109 (2002).
Lehoucq, R. B., Sorensen, D. C. & Yang, C. Arpack users guide: Solution of large scale eigenvalue problems by implicitly restarted Arnoldi methods, https://doi.org/10.1137/1.9780898719628.ch1 (1997).
Xiao, T., Wang, X., Efros, A. A. & Darrell, T. What should not be contrastive in contrastive learning. In International Conference on Learning Representations (2021).
L. M., Dery, P., Michel, A., Talwalkar & G., Neubig, Should we be pretraining? an argument for endtask aware training as an alternative. In International Conference on Learning Representations (2022).
Oord, A. v., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at http://arxiv.org/abs/1807.03748 (2019).
Tian, Y., Krishnan, D. & Isola, P. Contrastive multiview coding. ECCV (2020).
Wang, T. & Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning PMLR 119 (2020).
Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O. & Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning PMLR 97 (2019).
Wu, M, Zhuang, C., Mosse, M., Yamins, D. & Goodman, N. On mutual information in contrastive learning for visual representations. Preprint at http://arxiv.org/abs/2005.13149 (2020).
Chmiela, S., Sauceda, H. E., Müller, K.R., & Tkatchenko, A. Towards exact molecular dynamics simulations with machinelearned force fields. Nat. Commun. 9, 3887 (2018).
Fuchs, F. B., Worrall, D. E., Fischer, V. & Welling, M. SE(3)transformers: 3D rototranslation equivariant attention networks. In Advances in Neural Information Processing Systems (2020).
Chen, Z. et al. Direct prediction of phonon density of states with Euclidean neural networks. Adv. Sci. 8, 2004214 (2021).
Weiler, M & Cesa, G. General E(2)equivariant steerable CNNs. In Advances in Neural Information Processing Systems 32 (2019).
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32th International Conference on Machine Learning JMLR 37 (2015).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In ICML 807–814 (2010).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In International Conference for Learning Representations (2015).
Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In International Conference for Learning Representations (2017).
Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Loh, C. & Christensen, T. Sibcl (v1.0) [code repository]. Zenodo https://doi.org/10.5281/zenodo.6710073 (2022).
Paszke, A. et al. PyTorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, 8024–8035 (2019).
Acknowledgements
We thank Peter Lu, Andrew Ma, Ileana Rugina, Hugo Larochelle, and Li Jing for fruitful discussions. We acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported here. This work was sponsored in part by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA87501921000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was also sponsored in part by the the National Science Foundation under Cooperative Agreement PHY2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/) and in part by the US Office of Naval Research (ONR) Multidisciplinary University Research Initiative (MURI) grant N000142012325 on Robust Photonic Materials with HighOrder Topological Protection. This material is also based upon work supported in part by the U. S. Army Research Office through the Institute for Soldier Nanotechnologies at MIT, under Collaborative Agreement Number W911NF1820048 and upon work supported by the Air Force Office of Scientific Research under the award number FA95502110317. C.L. acknowledges financial support from the DSO National Laboratories, Singapore. S.K. acknowledges support from the Department of Defense through the National Defense Science and Engineering Graduate Fellowship Program.
Author information
Authors and Affiliations
Contributions
C.L., T.C., and M.S. conceived the research. C.L. designed and implemented the framework and performed the experiments. T.C., R.D., and S.K. contributed to the development of the framework. T.C. and M.S. supervised the research. C.L. and T.C. wrote the manuscript with input from all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Loh, C., Christensen, T., Dangovski, R. et al. Surrogate and invarianceboosted contrastive learning for datascarce applications in science. Nat Commun 13, 4223 (2022). https://doi.org/10.1038/s4146702231915y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702231915y
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.