Introduction

Deep links between physics and theories of computation1,2 are being increasingly exploited to uncover new fundamental physics and to provide novel insights into theories of computation. For example, advances in understanding quantum entanglement are often expressed in sophisticated information theoretic language, while providing new results in computational complexity theory such as polynomial time algorithms for integer factorization3. These connections are typically expressed in terms of Shannon information, with its natural analogy with thermodynamic entropy.

There is, however, another branch of information theory, called algorithmic information theory (AIT)4, which is concerned with the information content of individual objects. It has been much less applied in physics (although notable exceptions occur, see5 for a recent overview). Reasons for this relative lack of attention include that AIT’s central concept, the Kolmogorov complexity KU(x) of a string x, defined as the length of the shortest program that generates x on an optimal reference universal Turing machine (UTM) U, is formally uncomputable due to its link to the famous halting problem of UTMs6 — see7 for technical details. Moreover, many important results, such as the invariance theorem which states that for two UTMs U and W, the Kolmogorov complexities \({K}_{U}(x)={K}_{W}(x)+{\mathcal{O}}(1)\) are equivalent, hold asymptotically up to \({\mathcal{O}}(1)\) terms that are independent of x, but not always well understood, and therefore hard to control.

Another reason applications of AIT to many practical problems have been hindered can be understood in terms of hierarchies of computing power. For example, one of the oldest such categorisations, the Chomsky hierarchy8, ranks automata into four different classes, of which the UTMs are the most powerful, and finite state machines (FSMs) are the least. Many key results in AIT are derived by exploiting the power of UTMs. Interestingly, if physical processes can be mapped onto UTMs, then certain properties can be shown to be uncomputable9,10. However, many problems in physics are fully computable, and therefore lower on the Chomsky hierarchy than UTMs. For example, finite Markov processes are equivalent to FSMs, and RNA secondary structure (SS) folding algorithms can be recast as context-free grammars, the second level in the hierarchy. Thus, an important cluster of questions for applications of AIT revolve around extending its methods to processes lower in computational power than UTMs.

To explore ways of moving beyond these limitations and towards practical applications, we consider here one of the most iconic results of AIT, namely Levin’s coding theorem11 (see also the work of Solomonoff12 who had a version of the a priori distribution), which predicts that upon randomly chosen programs (e.g. a binary string program constructed by coin flips), the probability PU(x) that a universal Turing machine (UTM) generates output x can be bounded as \({2}^{-K(x)}\le P(x)\le {2}^{-K(x)+{\mathcal{O}}(1)}\). Technically, because we require the programs of the Turing machine to be prefix-free, we restrict ourselves to what are called prefix Turing machines. Given this profound prediction of a general exponential bias towards simplicity (low Kolmogorov complexity) one might have expected widespread study and applications in science and engineering. This has not been the case because the theorem unfortunately suffers from the general issues of AIT described above.

Nevertheless, it has recently been shown13,14 that a related exponential bias towards low complexity outputs obtains for a range of non-universal input-output maps fI → O that are lower on the Chomsky hierarchy than UTMs.

In particular, an upper bound on the probability P(x) that an output obtains upon uniform random sampling of inputs,

$$P(x)\le {2}^{-a\widetilde{K}(x)-b}\ $$
(1)

was recently derived13 using a computable approximation \(\widetilde{K}(x)\) to the Kolmogorov complexity of x, typically calculated using lossless compression techniques. Here a and b are constants that are independent of x and which can often be determined from some basic information about the map. The so-called simplicity bias bound (1) holds for computable maps f where the number of inputs NI is much greater than the number of outputs NO and the map f is simple, meaning that asymptotically \(K(f)+K(n)\ll K(x)+{\mathcal{O}}(1)\) for a typical output x, where n specifies the size of NI, e.g. NI = 2n. It is important to distinguish the map f (which we assume is simple) and the initial conditions15, i.e. a program for x (which may or may not be simple).

Equation (1) typically works better for larger NI and NO. Approximating the true Kolmogorov complexity also means that the bound shouldn’t work for maps where a significant fraction of outputs have complexities that are not qualitatively captured by compression based approximations. For example many pseudo random-number generators are designed to produce outputs that appear to be complex when measured by compression or other types of Kolmogorov complexity approximations. Yet these outputs must have low K(x) because they are generated by relatively simple algorithms with short descriptions. Nevertheless, it has been shown that the bound (1) works remarkably well for a wide class of input-output maps, ranging from the sequence to RNA secondary structure map, to systems of coupled ordinary differential equations, to a stochastic financial trading model, to the parameter-function map for several classes of deep neural networks13,16,17.

The simplicity bias bound (1) predicts that high P(x) outputs will be simple, and that complex outputs will have a low P(x). But, in sharp contrast to the full AIT coding theorem, it doesn’t have a lower bound, allowing low \(\widetilde{K}(x)\) outputs with low P(x) that are far from the bound. Indeed, this behaviour is generically observed for many (non-universal) maps13,16 (see also Fig. 1), but should not be the case for UTMs that obey the full coding theorem. Understanding the behaviour of outputs far from the bound should shed light on fundamental differences between UTMs and maps with less computational power that are lower on the Chomsky hierarchy, and may open up avenues for wider applications of AIT in physics.

Figure 1
figure 1

The probability P(x) that a particular output arises upon random sampling of inputs versus output complexity \(\widetilde{K}(x)\) shows clear simplicity bias for: (a) A length L = 15 RNA sequence to SS mapping, (b) An FST, sampled over all 230 binary inputs of length 30, and (c) A 7-input perceptron with weights discretised to 3 bits. The black solid line is the simplicity bias bound (1) (with a and b fit). For all these maps high complexity outputs occur with low probability. The outputs are colour coded by the maximum complexity \({K}_{\max }(p| n)\) of the set of inputs mapping to output x. Outputs further from the bound have lower input complexities. (d) length L = 15 RNA, (e) the FST and (d) the perceptron, show the data plotted for the lower bound (8) (black line) with only the intercept fit to the data, the slope is a prediction. The orange line is using Eq (8) with a normalised probability for a parameter free predictor. Including the complexity of the input through \({K}_{\max }(p| n)\) reduces the spread in the data, and so provides more predictive power than K(x) alone.

Results

With this challenge in mind, we take an approach that contrasts with the traditional coding theorem of AIT or with the simplicity bias bound, which only consider the complexity of the outputs. Instead, we derive bounds that also take into account the complexity of the inputs that generate a particular output x. While this approach is not possible for UTMs, since the halting problem means one cannot enumerate all inputs4, and so averages over input complexity cannot be calculated, it can be achieved for non-UTM maps. Among our main results, we show that the further outputs are from the simplicity bias bound (1), the lower the complexity of the set of inputs. Since, by simple counting arguments, most strings are complex4, the cumulative probability of outputs far from the bound is therefore limited. We also show that by combining the complexities of the output with that of the inputs, we can obtain better bounds on and estimates of P(x).

Whether such bounds nevertheless have real predictive power needs to be tested empirically. Because input based bounds typically need exhaustive sampling, full testing is only possible for smaller systems, which restricts us here to maps where finite size effects may still play a role13. We test our bounds on three systems, the famous RNA sequence to secondary structure map (which falls into the context-free class in the Chomsky hierarchy), here for a relatively small size with length L = 15 sequences, a finite state transducer (FST), a very simple input-output map that is lowest on the Chomsky hierarchy8, with length L = 30 binary inputs, and finally the parameter-function map16,17 of a perceptron18 with discretized weights to allow complexities of inputs to be calculated. The preceptron plays a key role in deep learning neural network architectures19. Nevertheless, as can be seen in Fig. 1(a–c) all three maps exhibit simplicity bias predicted by Eq (1), even if they are relatively small. In ref. 13, much cleaner simplicity bias behaviour can be observed for larger RNA maps, but these are too big to exhaustively sample inputs. Similarly, cleaner simplicity bias behaviour occurs for the undiscretised perceptron17, but then it is hard to analyse the complexity of the inputs. Figure 1(a–c) shows that the complexity of the input strings that generate each output x decreases for further distances from the simplicity bias bound. This is the kind of phenomenon that the we will attempt to explain.

To study input based bounds, consider a map fI → O between NI inputs and NO outputs that satisfies the requirements for simplicity bias13. Let f(p) = x, where p is some input program p I producing output x O. For simplicity let p {0, 1}n, so that all inputs have length n and NI = 2n (this restriction can be relaxed later). Define f−1(x) to be the set of all the inputs that map to x, so that the probability that x obtains upon sampling inputs uniformly at random is

$$P(x)=\frac{| \,{f}^{-1}(x)| }{{2}^{n}}\ $$
(2)

Any arbitrary input p can be described using the following \({\mathcal{O}}(1)\) procedure13: Assuming f and n are given, first enumerate all 2n inputs and map them to outputs using f. The index of a specific input p within the set f−1(x) can be described using at most \({{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )\) bits. In other words, this procedure identifies each input by first finding the output x it maps to, and then finding its label within the set f−1(x). Given f and n, an output x = f(p) can be described using \(K(x| \,f,n)+{\mathcal{O}}(1)\) bits13. Thus, the Kolmogorov complexity of p, given f and n can be bounded as:

$$K(p| \,f,n)\le K(x| \,f,n)+{{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )+{\mathcal{O}}(1).$$
(3)

We note that this bound holds in principle for all p, but that it is tightest for \({K}_{\max }(p| x)\equiv {\max }_{p}\{K(p| \,f,n)\}\) for p f−1(x). More generally, we can expect these bounds to be fairly tight for the maximum complexity \({K}_{\max }(p| \,f,n)\) of inputs due to the following argument. First note that

$${K}_{\max }(p| \,f,n)\ge {{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )+{\mathcal{O}}(1)$$
(4)

because any set of | f−1(x)| different elements must have strings of at least this complexity. Next,

$$K(x| \,f,n)\le K(p| \,f,n)+{\mathcal{O}}(1)$$
(5)

because each p can be used to generate x. Therefore:

$$\max (K(x| \,f,n),{{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| ))\le {K}_{\max }(p| \,f,n)+{\mathcal{O}}(1),$$
(6)

so the bound (3) cannot be too weak. In the worst case scenario, where \({K}_{\max }(p| n)\approx {{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )\approx K(x| \,f,n)\), the right hand side of the bound (3) is approximately twice the left hand side (up to additive \({\mathcal{O}}(1)\) terms). It is tighter if either K(x| fn) is small, or if K(x| fn) is big relative to \({{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )\). As is often the case for AIT predictions, the stronger the constraint/prediction, the more likely it is to be observed in practice, because, for example, the \({\mathcal{O}}(1)\) terms are less likely to drown out the effects.

By combining with Eq. (2), the bound (3) can be rewritten in two complementary ways. Firstly, a lower bound on P(x) can be derived of the form:

$$P(x)\ge {2}^{-K(x| f,n)-[n-K(p| f,n)]+{\mathcal{O}}(1)}$$
(7)

p f−1(x) which complements the simplicity bias upper bound (1). This bound is tightest for \({K}_{\max }(p| n)\).

In ref. 13 it was shown that \(P(x)\le {2}^{-K(x| f,n)+{\mathcal{O}}(1)}\) by using a similar counting argument to that used above, together with a Shannon-Fano-Elias code procedure. Similar results can be found in standard works4,20. A key step is to move from the conditional complexity to one that is independent of the map and of n. If f is simple, then the explicit dependence on n and f can be removed by noting that since \(K(x)\le K(x| \,f,n)+K(\,f)+K(n)+{\mathcal{O}}(1)\), and \(K(x| \,f,n)\le K(x)+{\mathcal{O}}(1)\) then \(K(x| \,f,n)\approx K(x)+{\mathcal{O}}(1)\). In Eq. (1) this is further approximated as \(K(x| \,f,n)+{\mathcal{O}}(1)\approx a\widetilde{K}(x)+b\), leading to a practically useable upper bound. The same argument can be used to remove explicit dependence on n and f for K(p| fn).

If we define a maximum randomness deficit \({\delta }_{\max }(x)=n-{K}_{\max }(p| n)\), then this tightest version of bound (7) can be written in a simpler form as

$$P(x)\ge {2}^{-a\widetilde{K}(x)+b-{\delta }_{\max }(x)+{\mathcal{O}}(1)}$$
(8)

In Fig. 1(d–f) we plot this lower bound for all three maps studied. Throughout the paper, we use a scaled complexity measure, which ensures that \(\widetilde{K}(x)\) ranges between ≈0 and ≈n bits, for strings of length n, as expected for Kolmogorov complexity. See Methods for more details.

When comparing the data in Fig. 1(d–f) to Fig. 1(a–c), it is clear that including the input complexities reduces the spread in the data for RNA and the FST, although for the perceptron model the difference is less pronounced. This success suggests using the bound (8) as a predictor \(P(x)\approx {2}^{-K(x| f,n)-{\delta }_{\max }(x)}\), with the additional constraint that ∑xP(x) = 1 to normalise it. As can be seen in Fig. 1(d–f), this simple procedure works reasonably well, showing that the input complexity provides additional predictive power to estimate P(x) from some very generic properties of the inputs and outputs.

A second, complimentary way that bound (3) can be expressed is in terms of how far P(x) differs from the simplicity bias bound (1):

$$[{{\rm{\log }}}_{2}({P}_{0}(x))-{{\rm{\log }}}_{2}(P(x))]\le [n-K(p| \,f,n)]+{\mathcal{O}}(1)$$
(9)

where \({P}_{0}(x)={2}^{-K(x| f,n)}\approx {2}^{-a\widetilde{K}(x)+b}\) is the upper bound (1) shown in Fig. 1(a–c).

For a random input p, with high probability we expect \(K(p| \,f,n)=n+{\mathcal{O}}(1)\)4. Thus, Eqs. (7) and (9) immediately imply that large deviations from the simplicity bias bound (1) are only possible with highly non-random inputs with a large randomness deficit \({\delta }_{\max }(x)\).

In Fig. 2(a–c) we directly examine bound (9), showing explicitly the prediction that a drop of probability P(x) by Δ bits from the simplicity bias bound (1) corresponds to a Δ bit randomness deficit in the set of inputs.

Figure 2
figure 2

Deviation of P(x) from the simplicity bias upper bound (1) increases with increasing randomness deficit \({\delta }_{\max }(x)=n-{K}_{\max }(p| n)\) for (a) L = 15 RNA, (b) L = 30 FST, (c) perceptron with weights discretised to 4 bits. For the perceptron, all functions with the same P(x) and K(x) are averaged together to reduce scatter. Points are colour coded by output complexity K(x). For the upper bound (9) (black line) we fit the intercept, but the slope is a prediction, if we treat it as a normalised probability we obtain the orange line which is a direct prediction with no free parameters.

Simple counting arguments can be used to show that the number of non-random inputs is a small fraction of the total number of inputs21. For example, for binary strings of length n, with NI = 2n, the number of inputs with complexity K = n − δ is approximately 2δNI. If we define a set \({\mathcal{D}}(f)\) of all outputs xi that satisfy \(({{\rm{\log }}}_{2}({P}_{0}({x}_{i}))-{{\rm{\log }}}_{2}({P}_{({x}_{i})}))\ge \Delta \), i.e. the set of all outputs for which \({{\rm{\log }}}_{2}P(x)\) is at least Δ bits below the simplicity bias bound (1), then this counting argument leads to the following cumulative bound:

$$\sum _{x\in {\mathcal{D}}(f)}P(x)\le {2}^{-\Delta +1+{\mathcal{O}}(1)}$$
(10)

which predicts that, upon randomly sampling inputs, most of the probability weight is for outputs with P(x) relatively close to the upper bound. There may be many outputs that are far from the bound, but their cumulative probability drops off exponentially the further they are from the bound because the number of simple inputs is exponentially limited. Note that this argument is for a cumulative probability over all inputs. It does not predict that for a given complexity K(x), that the outputs should all be near the bound. In that sense this lower bound is not like that of the original coding theorem which holds for any output x. See the Supplementary Information for an alternative derivation of this bound.

Bound (10) does not need an exhaustive enumeration to be tested. In Fig. 3 we show this bound for a series of different maps, including many maps from13. The cumulative probability weight scales roughly as expected, implying that most of the probability weight is relatively close to the bound (at least on a log scale).

Figure 3
figure 3

The cumulative probability versus the distance from the bound Δ correlates with the the cumulative bound (10) (red line) for (a) L = 15 RNA and (b) L = 30 FST (c) Perceptron. (d) fully connected 2 layer neural network from16, (e) coarse-grained ordinary differential equation map from13, which describes a circadian rhythm model from40, (f) Ornstein-Uhlenbeck financial model from13, (g) L-systems from13, (h) simple matrix map from13. The solid red line is the prediction 2−Δ+1 from Eq. (10), the dashed line denotes 10% cumulative probability.

What is the physical nature of these low complexity, low probability outputs that occur far from the bound? They must arise in one way or another from the lower computational power of these maps, since they don’t occur in the full AIT coding theory. Low complexity, low probability outputs correspond to output patterns which are simple, but which the given computable map is not good at generating.

In RNA it is easy to construct outputs which are simple but will have low probability. Compare two L = 15 structures S1 = ((.(.(...).).)). and S2 = .((.((...)).))., which are both symmetric and thus have a relatively low complexity K(S1) = K(S2) = 21.4. Nevertheless they have a significant difference in probability, P(S2)∕P(S1) ≈ 560 because S1 has several single bonds, which is much harder to make according to the biophysics of RNA. Only specially ordered input sequences can make S1, in other words they are simple, with \({K}_{\max }(p| n)=8.6\). By contrast, the inputs of S2 are much higher at \({K}_{\max }(p| n)=21.4\) because they need to be constrained less to produce this structure. This example illustrates how the system specific details of the RNA map can unfavourably bias away from some outputs due to a system specific constraint.

Similar examples of system specific constraint for the FST and perceptron can be found in the SI. We hypothesise that such low complexity, low probability structures highlight specific non-universal aspects of the maps, and extra information (in the form of a reduced set of inputs) are needed to generate such structures.

Discussion

In conclusion, it is striking that bounds based simply on the complexity of the inputs and outputs can make powerful and general predictions for such a wide range of systems. Although the arguments used to derive them suffer from the well known problems – e.g. the presence of uncomputable Kolmogorov complexities and unknown \({\mathcal{O}}(1)\) terms – that have led to the general neglect of AIT in the physics literature, the bounds are undoubtably successful. It appears that, just as is found in other areas of physics, these relationships hold well outside of the asymptotic regime where they can be proven to be correct. This practical success opens up the promise of using such AIT based techniques to derive other results for computable maps from across physics.

Many new questions arise. Can it be proven when the \({\mathcal{O}}(1)\) terms are relatively unimportant? Why do our rather simple approximations to K(x) work? It would be interesting to find maps where these classical objections to the practical use of AIT are important. There may also be connections between our work and finite state complexity22 or minimum description length23 approaches. Progress in these domains should generate new fundamental understandings of the physics of information.

Methods

RNA sequence to secondary structure mapping

RNA is made of a linear sequence of 4 different kinds of nucleotides, so that there are NI = 4L possible sequences for any particular length L. A versatile molecule, it can store information, as messenger RNA, or else perform catalytic or structural functions. For functional RNA, the three-dimensional (3D) structure plays an important role in its function. In spite of decades of research, it remains difficult to reliably predict the 3D structure from the sequence alone. However, there are fast and accurate algorithms to calculate the so-called secondary structure (SS) that determines which base binds to which base. Given a sequence, these methods typically minimize the Turner model24 for the free-energy of a particular bonding pattern. The main contributions in the Turner model are the hydrogen bonding and stacking interactions between the nucleotides, as well as some entropic factors to take into account motifs such as loops. Fast algorithms based on dynamic programming allow for rapid calculations of these SS, and so this mapping from sequences to SS has been a popular model for many studies in biophysics.

In this context, we view it as an input-output map, from NI input sequences to NO output SS structures. This map has been extensively studied (see e.g.25,26,27,28,29,30,31,32,33) and provided profound insights into the biophysics of folding and evolution.

Here we use the popular Vienna package26 to fold sequences to structures, with all parameters set to their default values (e.g. the temperature T = 37°C). We folded all NI = 415 ≈ 109 sequences of length 15, into 346 different structures which were the free-energy minimum structures for those sequences. The number of sequences mapping to a structure is often called the neutral set size.

The structures can be abstracted in standard dot-bracket notation, where brackets denote bonds, and dots denote unbonded pairs. For example, ...((. . . .))..... means that the first three bases are not bonded, the fourth and fifth are bonded, the sixth through ninth are unbonded, the tenth base is bonded to the fifth base, the eleventh base is bonded to the fourth base, and the final four bases are unbounded.

To estimate the complexity of an RNA SS, we first converted the dot-bracket representation of the structure into a binary string x, and then used the complexity estimator described below to estimate its complexity. To convert to binary strings, we replaced each dot with the bits 00, each left-bracket with the bits 10, and each right-bracket with 01. Thus an RNA SS of length n becomes a bit-string of length 2n. As an example, the following n = 15 structure yields the displayed 30-bit string

$$..(((...)))....\to 000010101000000001010100000000$$

Because we are interested in exhaustive calculations, we are limited to rather small RNA sequence lengths. This means that finite-size effects may play an important role. In13, we compared the simplicity bias bound (1) from the main text to longer sequences where only partial sampling can be achieved, and showed much clearer simplicity bias is evident in those systems.

Finite state transducer

Finite state transducers (FSTs) are a generalization of finite state machines that produce an output. They are defined by a finite set of states \({\mathcal{S}}\), finite input and output alphabets \({\mathcal{I}}\) and \({\mathcal{O}}\), and a transition function \(T:{\mathcal{S}}\times {\mathcal{I}}\to {\mathcal{S}}\times {\mathcal{O}}\) defining, for each state, and input symbol, a next state, and output symbol. One also needs to define a distinguished state, \({S}_{0}\in {\mathcal{S}}\), which will be the initial state, before any input symbol has been read. Given an input sequence of L input symbols, the system visits different states, and simultaneously produces an output sequence of L output symbols.

FST form a popular toy system for computable maps. They can express any computable function that requires only a finite number of memory, and the number of states in the FSTs offers a good parameter to control the complexity of the map. The class of machines we described above is also known as Mealy machines34. If one restricts the transition function to only depend on the current state, one obtains Moore machines35. If one considers the input sequence to a Moore machine to be stochastic, it immediately follows that its state sequence follows a Markov chain, and its output sequence is a Markov information source. Therefore, FSTs can be used to model many stochastic systems in nature and engineering, which can be described by finite-state Markov dynamics.

FSTs lie in the lowest class in the Chomsky hierarchy. However, they appear to be biased towards simple outputs in a manner similar to Levin’s coding theorem. In particular, Zenil et al.14 show evidence of this by correlating the probability of FSTs and UTMs producing particular outputs. More precisely, they sampled random FSTs with random inputs, and random UTMs with random inputs, and then compared the empirical frequencies with with individual output strings are obtained by both families, after many samples of machines and inputs. For both types of machines, simple strings were much more likely to be produced than complex strings.

We use randomly generated FSTs with 5 states. The FSTs are generated by uniformly sampling complete initially connected DFAs (where every state is reachable from the initial state, and the transition function is defined for every input) using the library FAdo36, which uses the algorithm developed by Almeida et al.37. Output symbols are then added to each transition independently and with uniform probability. In our experiments, the inputs are binary strings and the outputs are binary strings of length L = 30. The outputs for the whole set of 2L input strings are computed using the HFST library (https://hfst.github.io/). Not all FSTs show bias, but we have observed that all those that show bias show simplicity bias, and have the same behavior as that shown in Fig. 1 for low complexity - low probability outputs.

We can see why some simple outputs will occur with low probability by considering system specific details of the FST. For an FST, an output of length n which is n/2 zeros followed by n/2 ones is clearly simple, but we find that it has a low probability. We can understand this intuitively as follows. Producing such a string requires the "counting” up to n/2 to know when to switch output, and counting requires a memory that grows with n, while FSTs have finite memory. We can also prove that, for instance, an FST that only produces such strings (for any n) is impossible. The set of possible strings that an FST can produce comprises a regular language, as constructed by using the output symbols at each transition as input symbols, giving us a non-deterministic finite automaton. Finally, using the pumping lemma38, it is easy to see that this family of strings isn’t a regular language.

Perceptron

The perceptron18 is the simplest type of artificial neural network. It consists of a single linear layer, and a single output neuron with binary activation. Because modern deep neural network architectures are typically made of many layers of perceptrons, this simple system is important to study17. In this paper we use perceptrons with Boolean inputs and discretized weights. For inputs x {0, 1}n, the discretized perceptron uses the following parametrized class of functions:

$${f}_{w,b}(x)={\bf{1}}(w\cdot x+b),$$

where w {−aa + δ, …, a − δa}n and b { − aa + δ, …, a − δa} are the weight vector and bias term, which take values in a discrete lattice with D := 2a/δ + 1 possible values per weight. We used D = 2k, so that each weight can be represented by k bits, and a = (2k − 1)/2, so that δ = 1. Note that rescaling all the weights w and the bias b by the same fixed constant wouldn’t change the family of functions.

To obtain the results in Fig. 1, for which n = 7, we represented the weights and bias with k = 3 bits. We exhausitvely enumerated all 23(7+1) possible values of the weights and vectors, and we counted how many times we obtained each possible Boolean function on the Boolean hypercube {0, 1}7. The weight-bias pair was represented using 3 × (7 + 1) = 24 bits. A pair (wb) is an input to the parameter-function map of the perceptron. The complexity of inputs to this map can therefore be approximated by computing the Lempel-Ziv complexity of the 24-bit representation of the pair (wb).

In Fig. 4, we compare the simplicity bias of a perceptron with real-valued weights and bias, sampled from a standard Normal distribution, to the simplicity bias of the perceptron with discretized weights. We observe that both display similar simplicity bias, although the profile of the upper bound changes slightly.

Figure 4
figure 4

Methods: Probability versus complexity \(\widetilde{K}(x)\) (measured here as CLZ(x) from Eq. (11) shows simplicity bias in the perceptron for (a) full continuous weights and (b) with discretised weights (as in Fig. 1 of the main text). Since weights and biases are real-valued in (a) it is not straightforward to measure the complexity of the inputs. It is, of course, possible to do so for the discretised weights of (b).

For the perceptron we can also understand some simple examples of low complexity, low probability outputs. For example, the function with all 0s except a 1, for the inputs (1, 0, 0, 0, 0, 0, 0) and (0, 1, 0, 0, 0, 0, 0) has a similar complexity to the function which only has 1s at the inputs (1, 0, 0, 0, 0, 0, 0) and (0, 1, 1, 1, 1, 1, 1). However, the latter has much lower probability. One can understand this because if we take the dot product of a random weight vector w with two different inputs x1 and x2, the results have correlation given x1 x2∕(||x1||||x2||). Therefore we expect the input (0, 1, 1, 1, 1, 1, 1) to be correlated to more other inputs, than (0, 1, 0, 0, 0, 0, 0), so that the probability of it having a different value than the majority of inputs (as is the case for the second function) is expected to be significantly lower.

Methods to estimate complexity \(\tilde{{\boldsymbol{K}}}{\boldsymbol{(}}{\boldsymbol{x}}{\boldsymbol{)}}\)

There is a much more extensive discussion of different ways to estimate the Kolmogorov complexity in the supplementary information of13 and16. Here we use compression based measures, and as in these previous papers, we these are based on the 1976 Lempel Ziv (LZ) algorithm39, but with some small changes:

$${C}_{LZ}(x)=\left\{\begin{array}{ll}{{\rm{\log }}}_{2}(n), & x={0}^{n}\ {\rm{or}}\ {1}^{n}\\ {{\rm{\log }}}_{2}(n)[{N}_{w}({x}_{1}\,...{x}_{n})+{N}_{w}({x}_{n}\,...{x}_{1})]/2, & \,{\rm{otherwise}}\,\end{array}\right.$$
(11)

Here Nw(x) is the number of code words found by the LZ algorithm. The reason for distinguishing 0n and 1n is merely an artefact of Nw(x) which assigns complexity K = 1 to the string 0 or 1, but complexity 2 to 0n or 1n for n ≥ 2, whereas the Kolmogorov complexity of such a trivial string actually scales as \({{\rm{\log }}}_{2}(n)\), as one only needs to encode n. In this way we ensure that our CLZ(x) measure not only gives the correct behaviour for complex strings in the \({{\rm{lim}}}_{n\to \infty }\), but also the correct behaviour for the simplest strings. In addition to the \({{\rm{\log }}}_{2}(n)\) correction, taking the mean of the complexity of the forward and reversed strings makes the measure more fine-grained, since it allows more values for the complexity of a string. Note that CLZ(x) can also be used for strings of larger alphabet sizes than just 0/1 binary alphabets.

To directly test the input based measures we typically need fairly small systems, where the LZ based measure above may show some anomalies (see also the supplementary information of13 for a more detailed description). Thus, for such small systems, or when comparing different types and sizes of objects (e.g. RNA SS and RNA sequences) a slightly different scaling may be more appropriate, which not only accounts for the fact that CLZ(x) > n for strings of length n, but also the lower complexity limit may not be ~0, which it should be (see also the discussion in the supplementary information of13). Hence we use a different rescaling of the complexity measure

$$\widetilde{K}(x)={{\rm{\log }}}_{2}({N}_{O})\cdot \frac{{C}_{LZ}(x)-\min ({C}_{LZ}(x))}{\max ({C}_{LZ}(x))-\min ({C}_{LZ}(x))}$$
(12)

which will now range between \(0\le \widetilde{K}(x)\ \le \ {{\rm{\log }}}_{2}({N}_{O})=n\) if for example NO = 2n. For large objects, this different scaling will reduce to the simpler one, because \(\max ({C}_{LZ}(x))\gg \min ({C}_{LZ}(x))\).

We note that there is nothing fundamental about using LZ to generate approximations to true Kolmogorov complexity. Many other approximations could be used, and their merits may depend on the details of the problems involved. For further discussion of other complexity measures, see for example the supplementary information of refs. 13,16.