Correlator Convolutional Neural Networks: An Interpretable Architecture for Image-like Quantum Matter Data

Machine learning models are a powerful theoretical tool for analyzing data from quantum simulators, in which results of experiments are sets of snapshots of many-body states. Recently, they have been successfully applied to distinguish between snapshots that can not be identified using traditional one and two point correlation functions. Thus far, the complexity of these models has inhibited new physical insights from this approach. Here, using a novel set of nonlinearities we develop a network architecture that discovers features in the data which are directly interpretable in terms of physical observables. In particular, our network can be understood as uncovering high-order correlators which significantly differ between the data studied. We demonstrate this new architecture on sets of simulated snapshots produced by two candidate theories approximating the doped Fermi-Hubbard model, which is realized in state-of-the art quantum gas microscopy experiments. From the trained networks, we uncover that the key distinguishing features are fourth-order spin-charge correlators, providing a means to compare experimental data to theoretical predictions. Our approach lends itself well to the construction of simple, end-to-end interpretable architectures and is applicable to arbitrary lattice data, thus paving the way for new physical insights from machine learning studies of experimental as well as numerical data.

Increasingly, image-like experimental data from quantum systems promises to offer greater insight into the physics of correlated quantum matter [1][2][3][4][5][6]. However, the traditional framework of condensed matter physics [7] lacks principled approaches for analyzing such image-like data and connecting the data with theoretical insights. Without such techniques, the connection between theory founded on simple fundamental principles and image-like data is at best a one-way street where theory can produce approximate images that only partially resemble the real data. Hence often the validity of a theory has been advocated for based on the degree of resemblance in select features of theory-based and real data.
There have been growing efforts to adopt data science tools that have proved effective at recognizing every-day objects for objective analysis of image-like data on quantum matter [3,8,9]. The key idea is to use the ability of neural networks to express and model functions to learn key features found in the image-like data in an objective manner. However, there are two central challenges to this approach. First, the "black box" nature of neural networks is particularly problematic when it comes to scientific applications, where it is critical that the outcome of the analysis is based on scientifically correct reasoning [10]. The second challenge unique to scientific application of supervised machine learning (ML) approaches is the shortage of real training data. Hence the community has generally relied on simulated data for training [3,9,11]. However, it has not been clear whether the neural networks trained on simulated data properly generalize to experimental data. The path to surmounting both of these issues is to obtain some form of interpretability in our models. To date, most efforts at interpretable machine learning on scientific data have relied on manual inspection and translation of learned features from training standard architectures [12][13][14]. Instead, here we propose an entirely new approach designed from the groundup to automatically learn information that is meaningful within the framework of physics.
The need for a principled data-centric approach is particularly great and urgent in the case of synthetic matter experiments such as quantum gas microscopy (QGM) [1], ion traps [15], and Rydberg atom arrays [16,17]. While our technique is generally applicable, in this work we focus on QGM, which enables researchers to directly sample from the many-body density matrix of strongly correlated quantum states that are simulated using ultra-cold atoms. With the quantum simulation of the fermionic Hubbard model finally reaching magnetism [2] and the strange metal regime [18,19], QGM is poised to capture a wealth of information on this famous model that bears many open questions and is closely linked to quantum materials. However, the real-space snapshots QGM measures are a fundamentally new form of data resulting from a direct projective measurement of a many-body density matrix as opposed to a thermal expectation value of ob-servables. While this means richer information is present in a full dataset, little is known about how to efficiently extract all the information. When it comes to the questions regarding the enigmatic underdoped region of the fermionic Hubbard model, the challenge is magnified by the fact that fundamentally different theories can predict QGM data with seemingly subtle differences within standard approaches [19,20].
In this letter, we develop Correlator Convolutional Neural Networks (CCNNs), a novel architecture with a set of nonlinearities designed to produce features that are directly interpretable in terms of correlation functions in image-like data (see Figure 1). Following training of this architecture, we employ regularization path analysis [21] to rigorously identify the features that are critical in the CCNN's performance. We apply this powerful combination of CCNNs and regularization path analysis to simulated and experimental QGM data of the under-doped Fermi Hubbard model. Following this, we discuss the new insights we gain regarding the hidden signatures of two theories, geometric string theory [22] and π-flux theory [23,24], as well as application to non-spin-resolved experimental data.
The Hubbard model of fermionic particles on a lattice is a famous model that bears many open questions and is closely linked to quantum materials such as hightemperature superconductors. The model Hamiltonian is given by where the first term describes the kinetic energy associated to electrons hopping between lattice sites, and the second term describes an on-site repulsion between electrons. At half-filling, and in the limit U t, the repulsive Hubbard model maps to the Heisenberg antiferromagnet [25]. However, the behavior of the model as the system is doped away from half-filling is not as wellunderstood. Several candidate theories exist which attempt to describe this regime, including geometric string theory [22] and π-flux theory [23,24]. These theories are conceptually very distinct, but at low dopings measurements in the occupation basis do not differ enough in simple conventional observables such as staggered magnetization or two-point correlation functions to fully explain previous ML success [9] in discrimination (see SM Sec. S.IV). Nevertheless, there are more subtle hidden structures involving more than two sites [20] which are noticeable. In the "frozen spin approximation" [26], geometric string theory predicts that the motion of the holes simply displaces spins backwards along the path the hole takes. Hence the propagation of the doped hole will tend to produce a "wake" of parallel line segments of aligned spins in its trail (see Fig. 2(a)). Meanwhile, the π-flux theory describes a spin liquid of singlet pairs, where it is more difficult to conceive of characteristic structures (see Fig. 2 Current QGM experiments are able to directly simulate the Fermi-Hubbard model, obtaining one or twodimensional occupation snapshots sampled from the thermal density matrix ρ ∼ e −βH prescribed by the model [2]. However, currently our experiment can only resolve a single spin species at a time, leaving all other sites appearing as empty. This is not a fundamental limitation of QGM experiments and in principle, complete spin and charge readout is possible [27,28]. As we aim to learn true spin correlations, in this work we use primarily simulated snapshots at doping δ = 0.09 sampled from the geometric string and π-flux theories using Monte Carlo sampling techniques under periodic boundary conditions (see SM Sec. S.I).
We point out that in the context of this paper, when referring to two models as different, we do not imply that they are fundamentally distinct, in the sense that they can not be connected smoothly without encountering a singularity in the partition function. Rather, this is a practical question: we have two or more mathematical procedures for generating many-body snapshots based on variational wavefunctions, Monte-Carlo sampling, or any other theoretical approach. Our goal is to develop a ML algorithm that separates snapshots based on which procedure they are more likely to come from and, most importantly, the algorithm should provide information about which correlation functions are most important for making these assignments.
To learn how to distinguish these two theories we propose a novel neural network architecture, Correlation Convolutional Neural Networks (CCNNs), schematically shown in Fig. 1. The input to the network is an image-like map with 3-channels Since the models we consider are restricted to the singlyoccupied Hilbert space, this input only takes on values 0 or 1. From this input, the CCNN constructs nonlinear "correlation maps" containing information of local spinhole correlations up to some order N across the snapshot. This operation is parameterized by a set of learnable 3channel filters, {f α,k |α=1, · · · , M } where M is the number of filters in the model. The maps for the given filter α are defined as: . . . . The input is a three-channel image: The image is first convolved with learned filters fα to produce a set of convolutional maps C (1) α ( x). Maps containing information about higher-order local correlations can then be recursively constructed using the lower-order maps, truncating at some order N . Spatially averaging these maps produces features c (n) α which in expectation are equal to weighted sums of correlators found within the corresponding convolutional filter. These features are normalized to zero mean and unit variance by a BatchNorm layer, then used by a logistic classifier with coefficients β (n) α to produce the final outputŷ. Here a runs over the convolutional window of the filter α. Traditional convolutional neural networks employ only the first of these operations, alternating with some nonlinear activation function such as tanh or ReLU(x) = max(0, x). The issue with these typical nonlinear functions is that they mix all orders of correlations into the output features, making it difficult to disentangle what exactly traditional networks measure. In contrast, each order of our nonlinear convolutions C (n) α ( x) are specifically designed to learn n-site semi-local correlations in the vicinity of the site x, which appear as patterns in the convolutional filters f α . Note that the sums in Eq. 2 exclude any self-correlations to aid interpretability. During training, a CCNN tunes the filters f α,k ( a) such that correlators characteristic of the labeled theory are amplified while others are suppressed. To aid interpretation, we force all filters to be positive f α,k ( a) ≥ 0 by taking the absolute value before use on each forward pass.
A direct computation of the nonlinear convolutions following Eq. 2 up to order N requires O((KP ) N ) operations per site, where P is the number of pixels in the window of the filter and K is the number of species of particles. However, we can use the following recursive formula which we prove in the Supplement, Section S.II: where all powers are done pixelwise [29], and we define C α ( x) = 1. This improves the computational complexity to O(N 2 KP ) while also allowing us to leverage existing highly-optimized GPU convolution implementations. Use of this formula leads to a "cascading" structure to our model similar to [30], as seen in Fig. 1. First, the input S is convolved with filters f α to produce the firstorder maps C (1) α . Using Eq. 3, these first order maps can be used to construct second order maps C (2) α , and onwards until the model is truncated at some order N . Since the Hamiltonians being studied are translationinvariant, we then obtain estimates of correlators from these correlation maps by simple spatial averages to produce c In the back portion of a CCNN (see Fig. 1), the feature vector c is normalized using a BatchNorm layer [31], then used by a logistic classifier which produces the classification outputŷ( c; β, α } and are trainable parameters. Ifŷ < 0.5, the snapshot is classified as π-flux, and otherwise it is classified as geometric string theory. The β where y = {0, 1} is the label of the snapshot, and γ is the L1 regularization strength. The role of the L1 loss is to bias the filter patterns to be simple by turning off pixels which are unnecessary (see SM Sec. S.I) [32].
We fix the number of filters M and the maximum order of the non-linear convolutions N , a hyper-parameter specific to CCNN, by systematically observing the training performance. We found that two filters gives sufficient performance while allowing for simple interpretation. Hence we consider two filters, i.e., M = 2 in the rest of the paper. For the maximum order of non-linear convolution N we found the performance to rapidly increase with increase in N up to N = 4, past which performance plateaus. (see SM Sec. S.I.) Hence we fix N = 4 in the rest of the paper. Additionally, we limit our investigation to 3×3 convolutional filters. With the architecture of the CCNN so-fixed we found the performance of this minimalistic model to be comparable with a more complex traditional CNN architecture [9] (see SM Sec. S.I).
After a CCNN is trained, we fix the convolutional filters f α and move on to a second phase to interpret what it has learned. We first determine which features are the most relevant to the model's performance by constructing and analyzing regularization paths [21] to examine the role of the logistic coefficients β (n) α . We apply an L1 regularization loss to these β tion: α . This results in an optimization trade-off between minimizing the classification loss and attempting to keep β (n) α at zero, where the relative importance of these terms is tuned by λ. At large λ, the loss is minimized by keeping all β (n) α at zero, resulting in a 50% classification accuracy due to the model always predicting a single class. As λ is slowly ramped down, eventually the "most important" coefficient β (n) α will begin to activate, due to the decrease in classification loss surpassing the increase in the activation loss. As these coefficient couple the correlator features c (n) α to the prediction output, this process offers clear insight into which features are the most relevant.
We show a typical regularization path analysis in Fig. 3, where the filters f α of a trained model are shown in the inset. The activation of each coefficient β are the two first coefficients to activate. Furthermore, parallel tracking of the accuracy in Fig. 3(b) shows that the activation of these features results in large jumps in the classification accuracy, comprising almost all of the network's predictive power. While the details of the paths vary between training runs, we find robust dominance of fourth-order correlations as the first features to be activated to give the majority of the network's performance. Now that we know the fourth-order correlations are the important features, we look at which physical correlators are being measured by the features c (4) α by simply inspecting 4-pixel patterns made from high-intensity pixels from each channel of the learned filters, as we show in Fig. 4. Comparing these patterns with the depiction of the two candidate theories, we can understand why these correlators measured by the two filters are indeed prominent motifs. Specifically, the 2 × 2 correlators in the fourth-order feature of the filter associated to the geometric string theory ( Fig. 4(a)) are easily recognizable in the "wake" and the termination of a string. These discovered correlations are in agreement with those examined in Ref. [28], which found pronounced spin anticorrelations induced on the spins located on the diagonal adjacent to a mobile chargon. Meanwhile, the 2 × 2 motifs in the filter learned to represent the π-flux theory ( Fig. 4(b)) are either a single spin-flip or a simple placement of a hole into an AFM background. It is evident that this CCNN is learning the fingerprint correlations of geometric string theory, recognizing the π-flux theory instead from fluctuations which are uncharacteristic of the string picture. Furthermore, a subset of learned patterns that are not obvious from the simple cartoons can be used as additional markers to detect the states born out of the two theoretical hypotheses in experiment (see SM Sec. S.IV for more detail).
It is important to note that the above insights relied on the fact that our CCNN's structure can be understood as measured collections of correlators. Although the regularization path analysis can be applied to any architecture, the typical non-linear structures of off-theshelf CNNs inhibit direct connections between the dominant filters and physically meaningful information [33]. In SM Sec S.VI we present how interpretation of the architecture of Ref. [9] can be attempted following similar steps as above. Since the fully connected layer contains tens of thousands of parameters, after training we show that we can reduce this layer to a simple spatial averaging to attempt interpretation, with no loss in performance (see SM Sec. S.V). The reduced architecture with a single "feature" per convolutional filter, similar to the architecture of Ref. [33], is trained, after which we fix the filters for the regularization path analysis. We can clearly determine which filters produce the important features, but it is unclear what these features are actually measuring due to the ReLU nonlinearity. However, without any nonlinearity the architecture only achieves close to 50% performance. This failure to enforce simplicity on traditional architectures shows the importance of designing an architecture which measures physically meaningful information from the outset.
The ML method presented in this paper considers short-range multi-point correlations functions (up to 3 lattice sites in both x and y directions), but does not include long-range two-point correlations needed for identifying spontaneous symmetry breaking. Two considerations motivate this choice: i) Current experiments with the Fermi-Hubbard model are done in the regime where correlations involving charge degrees of freedom are not expected to exceed a few lattice constants due to thermal fluctuations. ii) The energy of systems with local interactions, such as the Fermi-Hubbard model, is primarily determined by short-range correlations. We note, however, that the current method can be extended to include longer range correlations either by expanding the size of the filters used in Eq. 2, or by using dilated convolutions.
To summarize, we proposed a new neural network architecture that is inherently interpretable as measuring sets of multi-site correlators. We then applied this architecture to the supervised learning problem of distinguishing two theoretical hypotheses for the doped Hubbard model: π-flux theory and geometric string theory. Employing a regularization path analysis technique on these trained CCNN architectures, we found that foursite correlators deriving from the learned filters hold the key fingerprints of geometric string theory. A subset of these four-site motifs fit into what is expected from the wake of a propagating hole in an antiferromagnetic background. The remaining four-site motifs which go beyond our existing intuition offer new insight into the problem.
The broad implications of CCNN-based machine learn-ing for analysis and acquisition of image-like data are twofold. Firstly, the revelation of specific high-order correlations as defining features of target states can guide experimental design. Specifically for QGM, our CCNN approach found fourth-order correlators carry the key signature distinguishing geometric string theory and π-flux snapshots. Our discovered patterns are consistent with a recent report on the importance of a specific fifth-order correlator found with a targeted manual analysis of simulated data [34]. These observations indicate that future experiments should target measurement of higher-order correlations, as also recommended in [9,19]. Secondly, CCNN can reveal any critical gaps between simulated and experimental data for neural-network based hypothesis testing. In particular, we found that for the data we had access to, experimental uncertainties on the actual doping and temperature due to the lack of spinresolution allowed the CCNN to focus solely on the doping level rather than meaningful correlation functions (see SM Sec. S.VI). For successful hypothesis testing, spin-resolved QGM datasets, which are just becoming accessible [27,28], will be necessary. With growing access to spatially resolved image-like data in quantum matter, we anticipate our CCNN approach can bring the power of neural networks to the design and analysis of image-like data that organically fits into the tradition of physics.
Disclaimer. This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

S.I Training Details
Training data generation Equilibrium snapshots of both models -π-flux theory and geometric string theory -can be sampled straightforwardly, as discussed in [9] and [20]. We here briefly summarize the corresponding sampling techniques.
Geometric string theory only makes a statement about how the doped model deviates from half-filling. The spinbackground at half-filling here is given by sampling snapshots of the Heisenberg model at finite temperature using quantum Monte Carlo techniques. We sample snapshots for a 40 × 40 system with periodic boundary conditions and then cut out a 16 × 16 observation region from each snapshot. For a given doping level, we then insert the corresponding number of holes into each snapshot at random positions, where holes cannot sit on the same site. The geometric string theory provides a distribution of string lengths for a given temperature [20,26]. For each hole, we thus sample a string length from the string length distribution and move the hole by hand for a corresponding number of sites through the spin background in random directions while displacing the spins along its path.
Snapshots from π-flux theory are generated using standard Metropolis Monte Carlo sampling of the Gutzwiller projected thermal density matrix of the mean-field Hamiltonian [35] Here, i ∈ A(B) denotes lattice sites i which are part of the A(B) sublattice andĉ i,σ is the annihilation (creation) operator of a fermion with spin σ. The mean-field Hamiltonian describes a system with staggered flux ±Φ = ±4θ 0 and effective hopping amplitude J * . In particular, we consider π-flux states with θ 0 = π/4. We simultaneously sample the occupation in momentum and real space. The real and momentum space configurations are denoted as |α r and |α k , respectively. In momentum space, the two spin species are treated separately, such that two fermions of opposite spin can occupy the same momentum state. In real space, two fermions with opposite spin cannot occupy the same site, thus directly implementing the Gutzwiller projection. In any given real space configuration |α r , each site is therefore either empty or occupied with a spin up or a spin down fermion. The mean field Hamiltonian (S1) can be readily diagonalized in momentum space. For each momentum space configuration |α k , we thus directly obtain an energy E(α k ) and from that the corresponding thermal weight. We use the Metropolis Monte Carlo algorithm [36] to sample Gutzwiller projected real space snapshots |α r according to the probability distribution The overall energy scale J * of this model is treated as a free parameter which is fit so that the nearest-neighbor spin correlators match with those of geometric string theory at half-filling. We sample snapshots of size 16 × 16 with periodic boundary conditions. Our dataset consists of 10000 sampled snapshots from each theory, 9000 of which go into the training set which is seen by the network, with the other 1000 being reserved for validation. This makes our training set size 18000, and our validation set size 2000. The exact partitioning of snapshots into these sets is determined by a random seed we call the "split seed".
Training Procedure As described in the main text, training is done in two phases. The first is done in PyTorch [37], in which the full model, including both the convolutional filters and the logistic classifier, is trained using the ADAM optimization algorithm. The resulting model from this process can be used as-is for classification. However, for an interrogation of which features are most important, regularization paths such as those in Fig. 3 are produced in a second phase. Here, the convolutional filters are held fixed and only the back logistic classifier is trained multiple times over a wide range of L1 λ coefficients. This phase is done in Scikit-Learn [38] due to its extremely efficient logistic regression routines.
During the first phase, the optimization process attempts to minimize the loss: where the first term is the standard cross-entropy loss used for classification tasks, and the second term is an L1 (or LASSO) regularization which has the effect of driving the unimportant components of the convolutional weights f α to zero.
To allow for simple interpretation of the resulting filters, we strictly limit the f α to take on only positive values. This can either be done by replacing the f α in place with their absolute value after each gradient update, or by simply taking the absolute value every time a forward pass is done. This was found to incur a ∼ 2% accuracy loss, which we find acceptable in order for easier interpretation. In contrast, forcing the filter weights to be positive seems to entirely halt the learning process for traditional CNN architectures.
The loss during the second phase is similar: with the notable difference that the LASSO regularization is now being applied to the logistic weights rather than the convolutional filters. As shown in Fig. S.10, we utilize a BatchNorm [31] layer intermediate between the nonlinear convolutions and the logistic classifier, without the additional learnable affine transformation typically used as we find these introduce additional complexity without much benefit for our problem. During training, this layer simply normalizes the features produced from each minibatch to be zero mean and unit variance to allow for easier classification. During validation, the layer uses exponential running estimates of the mean and variance for normalization rather than the minibatch statistics. We empirically found this layere to be essential to creating a well-performing architecture, with the hypothesis that this is related to the different scales that each nonlinear feature tends to exist at. Normalization brings all of the features to the same relative scale, allowing the classifier to have an easier time detecting distributional shifts.
However, there does exist an unintended interaction between L1 regularization and BatchNorm. Due to the normalization process, the architecture is invariant to an overall scaling of the filter weights. Meanwhile, our intended goal of using L1 was to drive "unimportant" pixels to zero. In a sense, the network can do this for "free" since it can scale the filter weights without any loss in performance. In practice however, we find that the L1 loss still does bias the network towards having lower complexity filters. However the relationship between the λ parameter and the number of pixels activated is not always simple -sometimes increasing λ will result in more pixels activated. While the interaction between L2 regularization and BatchNorm is well understood [39], the authors of this work are unaware of any similar understanding with L1 regularization. A solution to this issue is still a desired feature.
Symmetrization One factor leading to the overparameterization of standard CNNs is that to reach peak accuracy, they need to explicitly learn multiple symmetry-equivalent versions of spin patterns. To achieve the same effect without requiring the duplication of filters, we use a D 8 symmetry-equivariant form of the convolutional operation as introduced in [40]. A visual explanation of the operation as performed in a standard CNN pipeline can be seen in Fig. S.1.
Modification to suit our architecture is simple, following the steps described in [40] to extend this idea to arbitrary models. Before any operation is applied, a "symmetric slicing" operation is done which stacks extra rotated/flipped copies of the input into the batch dimension. The rest of the operations in the architecture are applied as usual to the entire batch. Then, before feeding the final features into the logistic classifier, a "symmetric pooling" operation applies the correct inverse symmetry operations to each copy of the input, then averages across them. This entire block of operations then forms features which are equivariant to the desired symmetries of the input. If these features are then spatially averaged, they instead form invariants (in which case the aforementioned inverse symmetry operations are not strictly needed). For fair comparison, every model examined in this work had this symmetrization applied. Performance Measurements In Fig. 2(a), we show the performance of our architecture at various orders to which the model is constructed, compared against a traditional CNN architecture using ReLU as the nonlinearity. (For details, see the "reduced architecture" described in Sec S.V). We also have compared to the architecture of [9], adapted to accept three-channel snapshots as input, though it is difficult to control the overfitting even with strong regularization. Meanwhile, our architecture does not show signs of significant overfitting even in absence of regularization due to its small parameterization. Out of all of our trials, no tested CNN has outperformed our CCNN models on the validation dataset.
Each curve shown is labeled with the number of filters that model contains; we increase the number of filters as the order of the model decreases to keep the total number of features relatively constant, for a fair comparison. The solid lines show the running-max (over all previous epochs) of the median validation accuracy achieved between five independent training runs on the same train-val split of the data, but with different parameter initializations and batching order, while the shaded regions shown the min-max spread across these models. Note that, to avoid unfairly biasing the higher-order networks, the models shown here are not trained with the L1 regularization on the filter weights.
We see that the performance of the architecture rapidly increases as a function of the order to which the model is constructed, plateauing past fourth order. At this order, we consistently match performance with traditional CNN architectures, even with only two convolutional filters and dramatically fewer learnable parameters. We find performance plateauing past fourth order to be a general behavior, independent of the number of filters, regularization strength, etc. This indicates that fifth-order and higher correlations provide no new statistically significant information between the two datasets, at least at the snapshot sizes we use.
In Fig. 2(b), we show the final trained performance of our architecture as a function of the number of filters used. For these measurements, L1 regularization is turned off, as it may prevent additional filters from being used at all. Interestingly, we find that we can get close to optimal performance with just a single convolutional filter, with performance quickly plateauing past this.
We find that the parallel spin and L-shaped patterns shown in the main text are generally robust features learned by the architecture. The "interlocking-L" pattern is extremely robust, with some variant occurring in nearly every trained model. While the parallel spin pattern does not appear in every trained model, it does seem to be the second most common pattern. In Fig. S.3, we show examples of models trained on the same data with different random seeds controlling the initialization. We note that while the exact filter pattern varies between training runs, the dominant subpatterns tend to match between all of runs.
(a) Running-max performance of our architecture compared at different orders we construct the model. We compare against a traditional CNN architecture using ReLU as the nonlinear function. For simplicity, in this section we will use subscripts to represent spatial indexing, and forgo a channel index. I.e. S i represents the value of the input at site i. Since information from different filters is never mixed, it is sufficient to consider this operation with a single filter f a where a indexes the sites within the convolutional window. Furthermore, define x a ≡ f a S i+a . Our goal is then to show, with the definition that the following recursive formula holds: with the definition that C (0) = 1. This can be confirmed by direct substitution and unwrapping the sum term-by-term. The l = 1 term of the sum reads If we imagine expanding this as a sum of products of n variables, there will exist two types of terms: those where all x's in the term are unique, and those where x a from the first sum equals exactly one of the x aj . Each of the former is overcounted by a factor of n, and each of the latter by a factor of n − 1: x aj (S8) The rest of the terms in the l sum of Eq. S6 serve solely to cancel out the extraneous terms on the right. We can see that the l = 2 term splits into two pieces, similar as to how the l = 1 term did: x aj (S9) So, adding together the l = 1 and l = 2 terms cancels the second sum on the right hand side of Eq. S8 which contains terms involving x 2 a , but introduces another extraneous sum of terms involving x 3 a . In general, the l th term expands to (S10) From this expansion, we can see that the left piece of the l th summand cancels the right piece of the l−1 th summand. This expansion continues to unzip up until the l = n term, (−1) n−1 a x n a , in which the right piece is zero. Hence, once the sum is fully unzipped, the only term remaining is the left piece of the l = 1 summand, which is exactly C (n) . This result still holds for multi-channel images and filters. To restore a channel index, all of the above equations will remain true with the transformations where k = {1, 2, . . . K} runs over the number of channels K in the input snapshot.
Using Eq. S6, we can compute each of these C (n) in order, efficiently utilizing the results of previous computations to only require O(N 2 KP ) operations per site total. The coefficients in parentheses can be seen to be the result of taking the convolution of the l th power of the convolutional filter with the l th power of the occupancy snapshot, taken pixelwise. We can save on an extra bit of computation (though not changing the overall complexity) if the system is fermionic, in which case S l = S for arbitrary l ≥ 1.

S.III Understanding the Behavior of Convolutional Activation Maps under Nonlinearities
Here we will take a moment to understand how nonlinearities applied to convolutional activation maps can be understood as extracting local correlations similar to the learned filter pattern. Consider an input snapshot S x , containing some observable at each location x (we can also imagine x as a multi-index including a channel dimension, to allow for multiple observables at each location). A convolution with some filter f a indexed within a window by a then produces the map C x = f a S x+a , where the sum over a is implied.
Within this section, we will define the expectation of variables as: that is, the average value of the variable when S is sampled from the thermal density matrix of some Hamiltionian H. Before applying any nonlinearity, the features we measure on average are just: Hence, what we would measure is simply a weighted sum of the average values of the observables -in our case, the occupations. This would be true for any linear map applied to the data. If we instead apply an analytic nonlinear function σ to the activation map, Taylor expand about f = 0, and then take the expectation we instead measure: We can see that each order of the expansion is incorporating information about the same-order correlation functions involving sites contained within the convolutional window. Hence, we can say that CNNs using traditional nonlinearities measure spatial correlations in the data at all orders which fit into the convolutional window.
We demonstrate this on a simple example in Fig. S.4. Here, we have trained a simple CNN identical to that in Fig. 8(b), but using the tanh nonlinearity rather than ReLU, and only with a single filter. The model is trained to distinguish between snapshots sampled from the geometric string theory, and from the "sprinkled holes" theory described in [20], where holes are simply randomly placed into snapshots sampled from the same antiferromagnetic Heisenberg model as used for string theory snapshot generation. The first histogram in the figure shows the distribution of C x when the trained model is applied to snapshots from each theory. As both theories have identical occupation statistics the means of each distribution are essentially identical, implying that C x directly is not a useful feature for classification. However, we can notice that the variances ∼ C 2 x related to two-point correlations, are slightly different.
The next histogram shows the distribution of tanh(C x ), where we can now see that the means of each distribution have separated very slightly. An alternate way to understand Eq. S14 is the statement that the mean of σ(C x ) is related to all higher moments of the C x distribution. The separation in means of tanh(C x ) is small on the level of individual pixels, but we can see that after a summation, the final distributions of x tanh(C x ) results are well separated allowing for simple classification. While we may not be able to perform an expansion for nonanalytic functions such as ReLU, it is this property of mixing moments of the C x distribution that allows nonlinearities to introduce higher-order correlations into the learned features.
The central idea behind CCNNs derives from these observations: we can directly use powers of the convolutional activation map to extract out correlations of different orders. Our nonlinear operations in Eq. 2 are simple polynomials of the inputs, designed such that self-correlations are removed for simpler direct interpretation.

S.IV Exact Measurements
To confirm that our ML models are indeed finding true physical features, we have explicitly calculated correlator estimates for both of the training datasets.
We first measure some examples of "simple" observables at the δ = 0.09 doping level studied in this work. We define s ij = +1, −1, 0 if a spin up, spin down, or hole, respectively, lives at site (i, j). We measure the staggered magnetization, and the sign-corrected nearest neighbor spin-spin correlator, , with results shown in Fig. S.5. We see that the staggered magnetization is nearly indistinguishable, though the nearest-neighbor spin correlator does show some minor deviation between the theories. However, this discrepancy is hardly enough to explain the > 80% classification accuracy of ours and previous [9] ML models. Indeed, our 2 nd -order CCNN can pick up this correlation, and only manages to achieve ≈ 63% classification accuracy. Further measurements find that all furtherrange two-point spin-spin correlators are nearly indistinguishable between these two theories.
We now turn to the fourth-order correlations discovered by the CCNN. In Fig. S.6, we show histograms of correlator estimates obtained from single snapshots contained within the two datasets. Due to the D 8 symmetry of the models, we average over all symmetry-equivalent versions of each correlator for each estimate as the symmetrization of our ML models would. From the figure, we can see that the patterns which are the dominant subpatterns of the learned CCNN filters are indeed biased towards the theory in alignment with what the model predicts, with many distributions being more clearly separated than the two-point NN correlator distributions from Fig. 5(b). We can also see from the figure that some subpatterns contained in the filters actually show no significant difference between the two theories; our interpretation of this is that these patterns emerge as "connections" when the CCNN attempts to include multiple significant patterns within a single filter. Since these connecting patters are statistically identical between the two theories, including them is a "free" action to the network which will not hamper performance.
In Fig. S.7, we plot measured fourth-order correlators obtained from the two datasets as a function of hole doping. While all models in this work are trained on data at 9% doping, this plot shows an interesting trend. We note that at 0% doping, the "parallel spin" correlator (red) is nearly identical between the two theories. It is only once a finite hole doping is introduced that these correlators begin to deviate from each other. This agrees with our explanation of strings leaving a "wake" of parallel spins, increasing this four-site correlator relative to the π-flux theory.
FIG. S.6. Explicit statistical measurements of the fourth-order correlators discovered in Fig. 4. Histogrammed are normalized counts of each pattern (and its symmetry equivalents) obtained from single snapshots of each theory, with each histogram scaled to integrate to one. Vertical lines denote the mean of each distribution.

S.V Regularization Paths of Traditional CNNs
The regularization path technique demonstrated in the main text can also be applied to shallow traditional CNNs as an alternative to more complex techniques such as Layerwise Relevancy Propagation [41] which are appropriate for deeper models. In these scenarios, regularization paths can still determine which filters are contributing most to the classification, however information about which orders of correlations are important is inaccessible. We present here an example, and additionally show that fully-connected layers tend to simply "memorize" quantum gas data rather than learn true physics by "reducing" a full CNN architecture to a simpler one with minimal modifications As a demonstration, consider a CNN with an extremely similar architecture as our CCNN, but using a standard nonlinear step, as seen in Fig. 8(b). The input S is convolved with a set of learned filters f α to produce activation maps C α . Then, a nonlinear function σ is applied pixelwise to the activation map to produceC α = σ(C α ); here we choose σ = ReLU. These are spatially averaged to produce c α features, which are used by a final logistic classifier with coefficients β α . Heuristically, each c α captures how much the patterns seen in the snapshot S "look like" the filter patterns f α . Note that a BatchNorm layer is not strictly needed here due to the relatively uniform scale of the features c α ; it may help training progress easier, but is not required for the model to work. Hence, we will omit it for this test. Additionally, we have found that the model becomes extremely difficult to train if the filter parameters f α are forced positive as we did for CCNNs. As a consequence, we were unable to apply this constraint, resulting in filters which are harder to interpret than for CCNNs.
To perform a "reduction" to this model from the architecture of [9], we first train their architecture exactly as we did for CCNNs: the full model is first trained with an L1 regularization on the filters f α . After training, the filters are frozen, the fully connected layer is replaced with simple spatial averaging, and the β α coefficients are regressed at various strengths of regularization applied to them. The result of doing this process can be seen in S.8. While not shown, we observed that the replacement of the fully-connected layer with simple averaging actually improves validation performance: this provides empirical evidence that this overparameterized layer is simply "memorizing" the input data rather than learning true physical features.
We can attempt to interpret the path in Fig. S.8 by examining the patterns of the filters which activate first. The first feature to activate, labeled in purple, seems to have something to do with local antiferromagnetic correlations. The second feature, in grey, as well as the blue feature, seem to have something to do with spin-hole correlations. However, it is not as clear how we should understand the red or brown features which activate. Additionally, for each of these patterns we are unable to tell what order feature from the pattern is actually used. The additional complexity of allowing for negative values in filters, along with being unable to disentangle different orders of correlations, make direct interpretation of traditional CNNs generally difficult. As a demonstration of a case where CCNNs can be helpful in detecting "trivial" features of the data, we will train a CCNN to distinguish between real experimental data and simulated π-flux snapshots, both at 9% doping. We tune the temperature of the π-flux simulation as to best match the nearest-neighbor spin correlator with experimental data, as done in [9]. The snapshots are zero-masked to the same geometry of the experiment. Because our QGM experiments currently cannot resolve both spin species simultaneously, we convert all spin-down sites in the π-flux snapshots to appear as empty sites. Additionally, double-hole pairs also appear as empty sites in the experiment due to a parity projection; to accomodate this we insert doublon-hole pairs randomly on neighboring sites in the lattice with a probability matching the theoretical predictions, also appearing as empty sites. As we have two species of site (spin up and "empty"), this results in a set of two-channel snapshots input to the network.
We only have access to a small amount of experimental data, meanwhile π-flux data is essentially limitless. However, if we naively train a ML model with a dataset that contains significantly more data of one class than the other, a significant local minima the network can be trapped in is to simply predict the more populous class all the time. For example, in our dataset we have 2476 experimental snapshots, and 19500 π-flux snapshots. A model which predicts π-flux all the time would achieve an 89% accuracy! A simple solution to this is to oversample the experimental snapshots. On each epoch, we randomly duplicate snapshots from the experimental dataset until there are 19500 in the dataset.
As in the main text, we train a CCNN with two filters, and with an L1 loss applied to the filter weights. We use a slightly stronger value of λ = 0.01, and this was found to not significantly reduce final accuracy compared to the no-regularization case. (Note that the accuracy is much lower here mainly due to the smaller spatial extent of the snapshot: the circular region accessed by experiment contains only 90 sites while the snapshots in the main text contain 15 × 15 = 225). The final filters learned are shown in Fig. 9(a). We can see that the L1 loss has completely turned off one of the filters, while the other still has a collection of pixels left on.
We might originally guess that this is looking for specific patterns in the snapshots that somewhat resemble the pattern of the filter, i.e. correlators which are subpatterns. However, once constructing the regularization path from the model, shown in Fig. 9(b), we can see this is not the case! The 1st order correlator explains essentially all of the network's performance, meaning the network is really just measuring the "empty site" occupancy. (In fact, we can see in this instance the higher-order correlations contribute to overfitting, as the validation precision drops when they activate). A simple explanation of this is that the doublon-hole density in the experiment must be higher than expected for the given temperature. Positive coefficients correspond to π-flux snapshots, negative correspond to experiment.
FIG. S.9. Model behavior when classifying experimental v. π-flux data. Here, due to the class imbalance, we plot average precision rather than accuracy in the regularization path, which can be thought of as the average of the two accuracies on the classes individually.

S.VII Further Explanation of the CCNN operation
In Fig. S.10, we explicitly write out the terms of Eq. 2 for an example filter. We can see that for the feature C (n) , to find all the terms one draws all patterns which can be made by choosing n pixels from across the channels. As the Hilbert space of each of our models is restricted to the singly-occupied subspace, we do not need to consider patterns with more than one pixel at the same site. Each of these terms is weighted by the product of the intensities of the pixels which constitute the pattern.