## Introduction

A key aspect of human intelligence is our ability to build theories about the world. This faculty is most clearly manifested in the historical development of science1 but also occurs in miniature in everyday cognition2 and during childhood development3. The similarities between the process of developing scientific theories and the way that children construct an understanding of the world around them have led to the child-as-scientist metaphor in developmental psychology, which views conceptual changes during development as a form of scientific theory discovery4,5. Thus, a key goal for both artificial intelligence and computational cognitive science is to develop methods to understand—and perhaps even automate—the process of theory discovery6,7,8,9,10,11,12,13.

In this paper, we study the problem of AI-driven theory discovery, using human language as a testbed. We primarily focus on the linguist’s construction of language-specific theories, and the linguist’s synthesis of abstract cross-language meta-theories, but we also propose connections to child language acquisition. The cognitive sciences of language have long drawn an explicit analogy between the working scientist constructing grammars of particular languages and the child learning their languages14,15. Language-specific grammar must be formulated within a common theoretical framework, sometimes called universal grammar. For the linguist, this is the target of empirical inquiry, for the child, this includes those linguistic resources that they bring to the table for language acquisition.

Natural language is an ideal domain to study theory discovery for several reasons. First, on a practical level, decades of work in linguistics, psycholinguistics, and other cognitive sciences of language provide diverse raw material to develop and test models of automated theory discovery. There exist corpora, data sets, and grammars from a large variety of typologically distinct languages, giving a rich and varied testbed for benchmarking theory induction algorithms. Second, children easily acquire language from quantities of data that are modest by the standards of modern artificial intelligence16,17,18. Similarly, working field linguists also develop grammars based on very small amounts of elicited data. These facts suggest that the child-as-linguist analogy is a productive one and that inducing theories of language is tractable from sparse data with the right inductive biases. Third, theories of language representation and learning are formulated in computational terms, exposing a suite of formalisms ready to be deployed by AI researchers. These three features of human language—the availability of a large number of highly diverse empirical targets, the interfaces with cognitive development, and the computational formalisms within linguistics—conspire to single out language as an especially suitable target for research in automated theory induction.

Ultimately, the goal of the language sciences is to understand the general representations, processes, and mechanisms that allow people to learn and use language, not merely to catalog and describe particular languages. To capture this framework-level aspect of the problem of theory induction, we adopt the paradigm of Bayesian Program Learning (BPL: see ref. 19). A BPL model of an inductive inference problem, such as theory and grammar induction, works by inferring a generative procedure represented as a symbolic program. Conditioned on the output of that program, the model uses Bayes’ rule to work backward from data (program outputs) to the procedure that generated it (a program). We embed classic linguistic formalisms within a programming language provided to a BPL learner. Only with this inductive bias can a BPL model then learn programs capturing a wide diversity of natural language phenomena. By systematically varying this inductive bias, we can study elements of the induction problem that span multiple languages. By doing hierarchical Bayesian inference on the programming language itself, we can also automatically discover some of these universal trends. But BPL comes at a steep computational cost, and so we develop new BPL algorithms which combine techniques from program synthesis with intuitions drawn from how scientists build theories and how children learn languages.

We focus on theories of natural language morpho-phonology—the domain of language governing the interaction of word formation and sound structure. For example, the English plurals for dogs, horses, and cats are pronounced /dagz/, /hɔrsәz/, and /kæts/, respectively (plural suffixes underlined; we follow the convention of writing phoneme sequences between slashes). Making sense of this data involves realizing that the plural suffix is actually /z/ (part of English morphology), but this suffix transforms depending on the sounds in the stem (English phonology). The suffix becomes /әz/ for horses (/hɔrsәz/) and other words ending in stridents such as /s/ or /z/; otherwise, the suffix becomes /s/ for cats (/kæts/) and other words ending in unvoiced consonants. Full English morphophonology explains other phenomena such as syllable stress and verb inflections. Figure 1a–c shows similar phenomena in Serbo-Croatian: just as English morphology builds the plural by adding /z/, Serbo-Croatian builds feminine forms by adding /a/. Just as English phonology inserts /ә/ at the end of /hɔrsәz/, Serbo-Croatian modifies a stem such as /yasn/ by inserting /a/ to get /yasan/. Discovering a language’s morphophonology means inferring its stems, prefixes, and suffixes (its morphemes), and also the phonological rules that predict how concatenations of these morphemes are actually pronounced. Thus acquiring the morpho-phonology of a language involves solving a basic problem confronting both linguists and children: to build theories of the relationships between form and meaning given a collection of utterances, together with aspects of their meanings.

We evaluate our BPL approach on 70 data sets spanning the morphophonology of 58 languages. These data sets come from phonology textbooks: they have high linguistic diversity, but are much simpler than full language learning, with tens to hundreds of words at most, and typically isolate just a handful of grammatical phenomena. We will then shift our focus from linguists to children, and show that the same approach for finding grammatical structure in natural language also captures classic findings in the infant artificial grammar learning literature. Finally, by performing hierarchical Bayesian inference across these linguistic data sets, we show that the model can distill universal cross-language patterns, and express those patterns in a compact, human understandable form. Collectively, these findings point the way toward more human-like AI systems for learning theories, and for systems that learn to learn those theories more effectively over time by refining their inductive biases.

## Results

One central problem of natural language learning is to acquire a grammar that describes some of the relationships between form (perception, articulation, etc.) and meaning (concepts, intentions, thoughts, etc.; Supplementary Discussion 1). We think of grammars as generating form-meaning pairs, 〈f, m〉, where each form corresponds to a sequence of phonemes and each meaning is a set of meaning features. For example, in English, the word opened has the form/meaning $$\left\langle /{{{\rm{op}}}}{\upvarepsilon}{{{\rm{nd}}}}/,\,[{{{{{\bf{stem}}}}}}:{{{{{\rm{OPEN}}}}}};{{{{{\bf{tense}}}}}}:{{{{{\rm{PAST}}}}}}]\right\rangle$$, which the grammar builds from the form/meaning for open, namely $$\left\langle /{{{\rm{op}}}}{\upvarepsilon}{{{\rm{n}}}}/,\,[{{{{{\bf{stem}}}}}}:{{{{{\rm{OPEN}}}}}}]\right\rangle$$, and the past-tense form/meaning, namely $$\left\langle /{{{{{\rm{d}}}}}}/,[{{{{{\bf{tense}}}}}}:{{{{{\rm{PAST}}}}}}]\right\rangle$$. Such form-meaning pairs (stems, prefixes, suffixes) live in a part of the grammar called the lexicon (Fig. 1c). Together, morpho-phonology explains how word pronunciation varies systematically across inflections, and allows the speaker of a language to hear just a single example of a new word and immediately generate and comprehend all its inflected forms.

### Model

Our model explains a set X of form-meaning pairs 〈f, m〉 by inferring a theory (grammatical rules) T and lexicon L. For now, we consider maximum aposteriori (MAP) inference–which estimates a single 〈T, L〉–but later consider Bayesian uncertainty estimates over 〈T, L〉, and hierarchical modeling. This MAP inference seeks to maximize P(T, LUG)∏f, mXP(f, mT, L), where UG (for universal grammar) encapsulates higher-level abstract knowledge across different languages. We decompose each language-specific theory into separate modules for morphology and for phonology (Fig. 2). We handle inflectional classes (e.g. declensions) by exposing this information in the observed meanings, which follows the standard textbook problem structure but simplifies the full problem faced by children learning the language. In principle, our framing could be extended to learn these classes by introducing an extra latent variable for each stem corresponding to its inflectional class. We also restrict ourselves to concatenative morphology, which builds words by concatenating stems, prefixes, and suffixes. Nonconcatenative morphologies20—such as Tagalog’s reduplication, which copies syllables—are not handled. We assume that each morpheme is paired with a morphological category: either a prefix (pfx), suffix (sfx), or stem. We model the lexicon as a function from pairs of meanings and morphological categories to phonological forms. We model phonology as K ordered rules, written $${\left\{{r}_{k}\right\}}_{k=1}^{K}$$, each of which is a function mapping sequences of phonemes to sequences of phonemes. Given these definitions, we express the theory-induction objective as:

$$\arg \mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}}P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}})\mathop{\prod}\limits_{\langle \, f,m\rangle \in {{{{{\bf{X}}}}}}} {\mathbb{1}}\left[f={{{{{\rm{Phonology}}}}}}({{{{{\rm{Morphology}}}}}}(m))\right]\hfill\\ {{{{{\rm{where}}}}}}\;{{{{{\rm{Morphology}}}}}}([{{{{{\bf{stem}}}}}}\!\!:\sigma ;\,i]) =\, {{{{{\bf{L}}}}}}(i,{\mathtt{pfx}})\cdot {{{{{\bf{L}}}}}}(\sigma,{\mathtt{stem}})\cdot {{{{{\bf{L}}}}}}(i,{\mathtt{sfx}})\\ \quad {{{concatenate}}}\;{{{prefix}}} ,\,{{{stem}}}\,,\,{{{suffix}}}\\ {{{{{\rm{Phonology}}}}}}(m) ={r}_{1}({r}_{2}(\cdots {r}_{K}(m)\cdots ))\\ \quad {apply}\,\,{ordered}\,\,{rewrite}\,\,{rules}$$
(1)

where [stem: σ;  i] is a meaning with stem σ, and i are the remaining aspects of meaning that exclude the stem (e.g., i could be [tense:PAST; gender:FEMALE]). The expression $${\mathbb{1}}\left[\cdot \right]$$ equals 1 if its argument is true and 0 otherwise. In words, Eq. (1) seeks the highest probability theory that exactly reproduces the data, like classic MDL learners21. This equation forces the model to explain every word in terms of rules operating over concatenations of morphemes, and does not allow wholesale memorization of words in the lexicon. Eq. (1) assumes fusional morphology: every distinct combination of inflections fuses into a new prefix/suffix. This fusional assumption can emulate arbitrary concatenative morphology: although each inflection seems to have a single prefix/suffix, the lexicon can implicitly cache concatenations of morphemes. For instance, if the morpheme marking tense precedes the morpheme marking gender, then L([tense:PAST; gender:FEMALE], pfx) could equal L([tense:PAST], pfx) L([gender:FEMALE], pfx). We use a description-length prior for P(T, LUG) favoring compact lexica and fewer, less complex rules (Supplementary Methods 3.4).

The data X typically come from a paradigm matrix, whose columns range over inflections and whose rows range over stems (Supplementary Methods 3.1). In this setting, an equivalent Bayesian framing (“Methods”) permits probabilistic scoring of new stems by treating the rules and affixes as a generative model over paradigm rows.

### Representing rules and sounds

Phonemes (atomic sounds) are represented as vectors of binary features. For example, one such feature is nasal, for which e.g. /m/, /n/, are +nasal. Phonological rules operate over this feature space. To represent the space of such rules we adopt the classical formulation in terms of context-dependent rewrites22. These are sometimes called SPE-style rules since they were used extensively in the Sound Pattern of English22. Rules are written (focus) → (structural change)/(left trigger)_(right trigger), meaning that the focus phoneme(s) are transformed according to the structural change whenever the left/right triggering environments occur immediately to the left/right of the focus (Supplementary Fig. 5). Triggering environments specify conjunctions of features (characterizing sets of phonemes sometimes called natural classes). For example, in English, phonemes which are [−sonorant] (such as /d/) become [-voice] (e.g., /d/ becomes /t/) at the end of a word (written #) whenever the phoneme to the left is an unvoiced nonsonorant ([− voice − sonorant], such as /k/), written [-sonorant] → [-voice]/[-voice -sonorant]_#. This specific rule transforms the past tense walked from /wɔkd/ into its pronounced form /wɔkt/. The subscript 0 denotes zero or more repetitions of a feature matrix, called the “Kleene star” operator (i.e., [+ voice]0 means zero or more repetitions of [+ voice] phonemes). When such rules are restricted to not be able to cyclically apply to their own output, the rules and morphology correspond to 2-way rational functions, which in turn correspond to finite-state transducers23. It has been argued that the space of finite-state transductions has sufficient representational power to cover known empirical phenomenon in morpho-phonology and represents a limit on the descriptive power actually used by phonological theories, even those that are formally more powerful, including Optimality Theory24.

To learn such grammars, we adopt the approach of Bayesian Program Learning (BPL). In this setting, we model each T as a program in a programming language that captures domain-specific constraints on the problem space. The linguistic architecture common to all languages is often referred to as universal grammar. Our approach can be seen as a modern instantiation of a long-standing approach in linguistics that adopts human-understandable generative representations to formalize universal grammar22.

### Inference

We have defined the problem a BPL theory inductor needs to solve, but have not given any guidance on how to solve it. In particular, the space of all programs is infinitely large and lacks the local smoothness exploited by local optimization algorithms like gradient descent or Markov Chain Monte Carlo. We adopt a strategy based on constraint-based program synthesis, where the optimization problem is translated into a combinatorial constraint satisfaction problem and solved using a Boolean Satisfiability (SAT) solver25. These solvers implement an exhaustive but relatively efficient search and guarantee that, given enough time, an optimal solution will be found. We use the Sketch26 program synthesizer, which can solve for the smallest grammar consistent with some data, subject to an upper bound on the grammar size (see “Methods”).

In practice, the clever exhaustive search techniques employed by SAT solvers fail to scale to the many rules needed to explain large corpora. To scale these solvers to large and complex theories, we take inspiration from a basic feature of how children acquire language and how scientists build theories. Children do not learn a language in one fell swoop, instead progressing through intermediate stages of linguistic development, gradually enriching their mastery of both grammar and lexicon. Similarly, a sophisticated scientific theory might start with a simple conceptual kernel, and then gradually grow to encompass more and more phenomena. Motivated by these observations, we engineered a program synthesis algorithm that starts with a small program, and then repeatedly uses a SAT solver to search for small modifications that allow it to explain more and more data. Concretely, we find a counterexample to our current theory, and then use the solver to exhaustively explore the space of all small modifications to the theory which can accommodate this counterexample. This combines ideas from counter-example guided inductive synthesis26 (which alternates synthesis with a verifier that feeds new counterexamples to the synthesizer) with test-driven synthesis27 (which synthesizes new conditional branches for each such counterexample); it also exposes opportunities for parallelism (Supplementary Methods 3.3). Figure 3 illustrates this incremental, solver-aided synthesis algorithm, while Supplementary Methods 3.3 gives a concrete walk-through of the first few iterations.

This heuristic approach lacks the completeness guarantee of SAT solving: it does not provably find an optimal solution, despite repeatedly invoking a complete, exact SAT solver. However, each such repeated invocation is much more tractable than direct optimization over the entirety of the data. This is because constraining each new theory to be close in theory-space to its preceding theory leads to polynomially smaller constraint satisfaction problems and therefore exponentially faster search times, because SAT solvers scale, in the worst case, exponentially with problem size.

### Quantitative analysis

We apply our model to 70 problems from linguistics textbooks28,29,30. Each textbook problem requires synthesizing a theory of a number of forms drawn from some natural language. These problems span a wide range of difficulties and cover a diverse set of natural language phenomena. This includes tonal languages, for example, in Kerewe, to count is /kubala/, but to count it is /kukíbála/, where accents mark high tones; languages with vowel harmony, for example Turkish has /el/, /tan/ meaning hand, bell, respectively, and /el-ler/, /tan-lar/ for the plurals hands, bells, respectively (dashes inserted at suffix boundaries for clarity); and many other linguistic phenomena such as assimilation and epenthesis (Fig. 4 and Supplementary Figs. 13).

We first measure the model’s ability to discover the correct lexicon. Compared to ground-truth lexica, our model finds grammars correctly matching the entirety of the problem’s lexicon for 60% of the benchmarks, and correctly explains the majority of the lexicon for 79% of the problems (Fig. 5a). Typically, the correct lexicon for each problem is less ambiguous than the correct rules, and any rules which generate the full data from the correct lexicon must be observationally equivalent to any ground truth rules we might posit. Thus, agreement with ground-truth lexica should act as a proxy for whether the synthesized rules have the correct behavior on the data, which should correlate with rule quality. To test this hypothesis we randomly sample 15 problems and grade the discovered rules, in consultation with a professional linguist (the second author). We measure both recall (the fraction of actual phonological rules correctly recovered) and precision (the fraction of recovered rules which actually occur). Rule accuracy, under both precision and recall, positively correlates with lexicon accuracy (Fig. 5c): when the system gets all the lexicon correct, it rarely introduces extraneous rules (high precision), and virtually always gets all the correct rules (high recall).

Prior approaches to morphophonological process learning either abandon theory induction by learning black-box probabilistic models31, or induce interpretable models but do not scale to a wide range of challenging and realistic data sets. These interpretable alternatives include unsupervised distributional learners, such as the MDL genetic algorithm in Rasin et al.32, which learns from raw word frequencies. Other interpretable models leverage strong supervision: Albright et al.33 learns rules from input–outputs, while ref. 34 learns finite state transducers in the same setting. Other works attain strong theoretical learning guarantees by restricting the class of rules: e.g., ref. 35 considers 2-input strictly local functions. These interpretable approaches typically consider 2–3 simple rules at most. In contrast, Goldwater et al.34 scales to tens of rules on thousands of words by restricting itself to non-interacting local orthographic rules.

Our results hinge on several factors. A key ingredient is a correct set of constraints on the space of hypotheses, i.e. a universal grammar. We can systematically vary this factor: switching from phonological articulatory features to simpler acoustic features degrades performance (simple features in Fig. 5a, b). Our simpler acoustic features come from the first half of a standard phonology text28, while the articulatory features come from the latter half, so this comparison loosely models a contrast between novice and expert phonology students (Supplementary Methods 3.5). We can further remove two essential sources of representational power–Kleene star, which allows arbitrarily long-range dependencies, and phonological features, which allow analogizing and generalizing across phonemes. Removing these renders only the simplest problems solvable (-representation in Fig. 5a, b). Basic algorithmic details also matter. Building a large theory at once is harder for human learners, and also for our model (CEGIS in Fig. 5a, b). The recent SyPhon36 algorithm strikes a different and important point on the accuracy/coverage tradeoff: it aims to solve problems in seconds or minutes so that linguists can interactively use it. In contrast, our system’s average solution time is 3.6 h (Fig. 5b). SyPhon’s speed comes from strong independence assumptions between lexica and individual rules, and from disallowing non-local rules. These assumptions degrade coverage: SyPhon fails to solve 76% of our data set. We hope that their work and ours sets the stage for future systems that run interactively while also more fully modeling the richness and diversity of human language.

### Child language generalization

If our model captures aspects of linguistic analysis from naturalistic data, and assuming linguists and children confront similar problems, then our approach should extend to model at least some aspects of the child’s linguistic generalization. Studying children (and adult’s) learning of carefully constructed artificial grammars has a long tradition in psycholinguistics and language acquisition37,38,39, because it permits controlled and careful study of the generalization of language-like patterns. We present our model with the artificial stimuli used in a number of AGL experiments38,39,40 (Fig. 6a), systematically varying the quantity of data given to the model (Fig. 6b). The model demonstrates few-shot inference of the same language patterns probed in classic infant studies of AGL.

These AGL stimuli contain very little data, and thus these few-shot learning problems admit a broad range of possible generalizations. Children select from this space of possible generalizations to select the linguistically plausible ones. Thus, rather than producing a single grammar, we use the model to search a massive space of possible grammars and then visualize all those grammars that are Pareto-optimal solutions41 to the trade-off between parsimony and fit to data. Here parsimony means size of rules and affixes (the prior in Eq. (10)); fit to data means average stem size (the likelihood in Eq. (10)); and a Pareto-optimal solution is one which is not worse than any other along both these competing axes. Figure 7 visualizes Pareto fronts for two classic artificial grammars while varying the number of example words provided to the learner, illustrating both the set of grammars entertained by the learner and how the learner weighs these grammars against each other. These figures show the exact contours of the Pareto frontier: these problems are small enough that exact SAT solving is tractable over the entire search space, so our heuristic incremental synthesizer is unneeded. With more examples the shape of the Pareto frontier develops a sharp kink around the correct generalization; with fewer examples, the frontier is smoother and more diffuse. By explaining both natural language data and AGL studies, we see our model as delivering on a basic hypothesis underpinning AGL research: that artificial grammar learning must engage some cognitive resource shared with first language acquisition. To the extent that this hypothesis holds, we should expect an overlap between models capable of learning real linguistic phenomena, like ours, and models of AGL phenomena.

### Synthesizing higher-level theoretical knowledge

No theory is built from scratch: Instead, researchers borrow concepts from existing frameworks, make analogies with other successful theories, and adapt general principles to specific cases. Through analysis and modeling of many different languages, phonologists (and linguists more generally) develop overarching meta-models that restrict and bias the space of allowed grammars. They also develop the phonological common sense that allows them to infer grammars from sparse data, knowing which rule systems are plausible based on their prior knowledge of human language, and which systems are implausible or simply unattested. For example, many languages devoice word-final obstruents, but almost no language voices word-final obstruents (cf. Lezgian42). This cross-theory common-sense is found in other sciences. For example, physicists know which potential energy functions tend to occur in practice (radially symmetric, pairwise, etc.). Thus a key objective for our work is the automatic discovery of a cross-language metamodel capable of imparting phonological common sense.

Conceptually, this meta-theorizing corresponds to estimating a prior, M, over language-specific theories, and performing hierarchical Bayesian inference across many languages. Concretely, we think of the meta-theory M as being a set of schematic, highly reusable phonological-rule templates, encoded as a probabilistic grammar over the structure of phonological rules, and we will estimate both the structure and the parameters of this grammar jointly with the solutions to textbook phonology problems. To formalize a set of meta-theories and define a prior over that set, we use the Fragment Grammars formalism43, a probabilistic grammar learning setup that caches and reuses fragments of commonly used rule subparts. Assuming we have a collection of D data sets (e.g., from different languages), notated $$\{{{{{\bf{X}}}}}^{d}\}_{d=1}^{D}$$, our model constructs D grammars, $${\{\langle {{{{{{\bf{T}}}}}}}^{d},{{{{{{\bf{L}}}}}}}^{d}\rangle \}}_{d=1}^{D}$$, along with a meta-theory M, seeking to maximize

$$P({{{{{\bf{M}}}}}})\mathop{\prod }\limits_{d=1}^{D}P({{{{{{\bf{T}}}}}}}^{d},{{{{{{\bf{L}}}}}}}^{d}|{{{{{\bf{M}}}}}})P({{{{{{\bf{X}}}}}}}^{d}|{{{{{{\bf{T}}}}}}}^{d},{{{{{{\bf{L}}}}}}}^{d})$$
(2)

where P(M) is a prior on fragment grammars over SPE-style rules. In practice, jointly optimizing over the space of Ms and grammars is intractable, and so we instead alternate between finding high-probability grammars under our current M, and then shifting our inductive bias, M, to more closely match the current grammars. We estimate M by applying this procedure to a training subset comprising 30 problems, chosen to exemplify a range of distinct phenomena, and then applied this M to all 70 problems. Critically this unsupervised procedure is not given access to any ground-truth solutions to the training subset.

This machine-discovered higher-level knowledge serves two functions. First, it is a form of human understandable knowledge: manually inspecting the contents of the fragment grammar reveals cross-language motifs previously discovered by linguists (Fig. 8c). Second, it can be critical to actually getting these problems correct (Fig. 8a, b and middle column of Fig. 8c). This occurs because a better inductive bias steers the incremental synthesizer toward more promising avenues, which decreases its chances of getting stuck in a neighborhood of the search space where no incremental modification offers improvement.

To be clear, our mechanized meta-theorizing is not an attempt to learn universal grammar (cf. ref. 44). Rather than capture a learning process, our meta-theorizing is analogous to a discovery process that distills knowledge of typological tendencies, thereby aiding future model synthesis. However, we believe that children possess implicit knowledge of these and other tendencies, which contributes to their skills as language learners. Similarly, we believe the linguist’s skill in analysis draws on an explicit understanding of these and other cross-linguistic trends.

## Discussion

Our high-level goal was to engineer methods for synthesizing interpretable theories, using morphophonology as a testbed and linguistic analysis as inspiration. Our results give a working demonstration that it is possible to automatically discover human-understandable knowledge about the structure of natural language. Like linguists, optimal inference hinges on higher-level biases and constraints; but the toolkit developed here permits systematic probing of these abstract assumptions and data-driven discovery of cross-language trends. Our work speaks to a long-standing analogy between the problems confronting children and linguists, and computationally cashes out the basic assumptions that underpin infant and child studies of artificial grammar learning.

Within phonology, our work offers a computational tool that can be used to study different grammatical hypotheses: mapping and scoring analyses under different objective functions, and studying the implications of different inductive biases and representations across a suite of languages. This toolkit can spur quantitative studies of classic phonological problems, such as probing extensionally-equivalent analyses (e.g., distinguishing deletion from epenthesis).

More broadly, the tools and approaches developed here suggest routes for machines that learn the causal structure of the world, while representing their knowledge in a format that can be reused and communicated to other agents, both natural and artificial. While this goal remains far off, it is worth taking stock of where this work leaves us on the path toward a theory induction machine: what are the prospects for scaling an approach like ours to other domains of language, or other domains of science more broadly? Scaling to the full linguistic hierarchy—acoustics, phonotactics, syntax, semantics, pragmatics—requires more powerful programming languages for expressing symbolic rules, and more scalable inference procedures, because although the textbook problems we solve are harder than prior work tackles, full morpho-phonology remains larger and more intricate than the problems considered here. More fundamentally, however, we advocate for hybrid neuro-symbolic models45,46,47 to capture crisp systematic productivity alongside more graded linguistic generalizations, such as that embodied by distributional models of language structure48.

Scaling to real scientific discovery demands fundamental innovations, but holds promise. Unlike language acquisition, genuinely new scientific theories are hard-won, developing over timescales that can span a decade or more. They involve the development of new formal substrates and new vocabularies of concepts, such as force in physics and allele in biology. We suggest three lines of attack. Drawing inspiration from conceptual role semantics49, future automated theory builders could introduce and define new theoretical objects in terms of their interrelations to other elements of the theory’s conceptual repertoire, only at the end grounding out in testable predictions. Drawing on the findings of our work here, the most promising domains are those which are solvable, in some version, by both child learners and adult scientists. This means first investigating sciences with counterparts in intuitive theories, such as classical mechanics (and intuitive physics), or cognitive science (and folk psychology). Building on the findings here and in ref. 11, a crucial element of theory induction will be the joint solving of many interrelated model building problems, followed by the synthesis of abstract over-hypotheses that encapsulate the core theoretical principles while simultaneously accelerating future induction through shared statistical strength.

Theory induction is a grand challenge for AI, and our work here captures only small slices of the theory-building process. Like our model, human theorists do craft models by examining experimental data, but they also propose new theories by unifying existing theoretical frameworks, performing thought experiments, and inventing new formalisms. Humans also deploy their theories more richly than our model: proposing new experiments to test theoretical predictions, engineering new tools based on the conclusions of a theory, and distilling higher-level knowledge that goes far beyond what our Fragment-Grammar approximation can represent. Continuing to push theory induction along these many dimensions remains a prime target for future research.

## Methods

### Program synthesis

We use the Sketch26 program synthesizer. Sketch can solve the following constrained optimization problem, which is equivalent to our goal of maximizing P(XT, L)P(T, LUG):

$$\begin{array}{ll}{{{{{\rm{maximize}}}}}}&F({{{{{\bf{X}}}}}},{{{{{\bf{T}}}}}})=\mathop{\sum }\limits_{k=1}^{K}\log P({r}_{k}|{{{{{\rm{UG}}}}}})+\mathop{\sum}\limits_{\langle f,c,m\rangle \in {{{{{\bf{L}}}}}}}\log P(f|{{{{{\rm{UG}}}}}})\hfill\\ {{{{{\rm{subject}}}}}}\,{{{{{\rm{to}}}}}}&\hskip -90pt C({{{{{\bf{X}}}}}},{{{{{\bf{T}}}}}})=\forall \,\langle f,\,[{{{{{\bf{stem}}}}}}\!:\sigma ;\,i]\rangle \in {{{{{\bf{X}}}}}}:\\ &f={r}_{1}\left(\cdots {r}_{K}({{{{{\bf{L}}}}}}(\langle i,{\mathtt{pfx}}\rangle )\cdot {{{{{\bf{L}}}}}}(\langle \sigma,{\mathtt{stem}}\rangle )\cdot {{{{{\bf{L}}}}}}(\langle i,{\mathtt{sfx}}\rangle ))\cdots \,\right)\end{array}$$
(3)

given observations X and bound on the number of rules K.

Sketch offers an exhaustive search strategy, but we use incremental solving in order to scale to large grammars. Mathematically this works as follows: we iteratively construct a sequence of theories T0, T1, ... alongside successively larger data sets X0, X1, ... converging to the full data set X, such that the tth theory Tt explains data set Xt, and successive theories are close to one another as measured by edit distance:

$${{{{{{\bf{X}}}}}}}_{t+1}={{{{{{\bf{X}}}}}}}_{t}\cup ({{{{{\rm{a}}}}}}\,{{{{{\rm{set}}}}}}\,{{{{{\bf{X}}}}}}^{\prime} \subseteq \,{{{{{\bf{X}}}}}}\,{{{{{\rm{where}}}}}}\,\neg \,C({{{{{{\bf{T}}}}}}}_{t},{{{{{\bf{X}}}}}}^{\prime} ))$$
(4)
$${D}_{t+1}=\mathop{\min }\limits_{D}D,\, {{{{{\rm{such}}}}}}\,{{{{{\rm{that}}}}}}: {{{{\exists}}}}\,{{{{{\bf{T}}}}}}\,{{{{{\rm{where}}}}}}\,C({{{{{\bf{T}}}}}},{{{{{{\bf{X}}}}}}}_{t+1})\,\,{{{{{\rm{and}}}}}}\,d({{{{{\bf{T}}}}}},{{{{{{\bf{T}}}}}}}_{t})\,\le \,D$$
(5)
$${{{{{{\bf{T}}}}}}}_{t+1}=\arg \mathop{\max }\limits_{{{{{{\bf{T}}}}}}}F({{{{{{\bf{X}}}}}}}_{t+1},{{{{{\bf{T}}}}}}),{{{{{\rm{such}}}}}}\,{{{{{\rm{that}}}}}}:{{{{{\bf{T}}}}}}\,{{{{{\rm{satisfies}}}}}}\,C({{{{{\bf{T}}}}}},{{{{{{\bf{X}}}}}}}_{t+1})\,{{{{{\rm{and}}}}}}\,d({{{{{\bf{T}}}}}},{{{{{{\bf{T}}}}}}}_{t})\, \le\,{D}_{t+1}$$
(6)

where d(  ,  ) measures edit distance between theories, Dt+1 is the edit distance between the theory at iteration t + 1 and t, and we use the t = 0 base cases $${{{{{{\bf{X}}}}}}}_{0}=\varnothing$$ and T0 is an empty theory containing no rules. We “minibatch” counterexamples to the current theory ($${{{{{\bf{X}}}}}}^{\prime}$$ in Eq. (4)) grouped by lexeme, and ordered by their occurrence in the data (e.g., if the theory fails to explain walk/walks/walked, and this is the next example in the data, then the surface forms of walk/walks/walked will be added to Xt+1). See Supplementary Methods 3.3.

We implement all models as Python 2.7 scripts that invoke Sketch 1.7.5, and also use Python 2.7 for all data analysis.

### Allophony problems

Allophony problems comprise the observed form-meaning set X, as well as a substitution, which is a partial function mapping phonemes to phonemes (see Supplementary Methods 3.1). This mapping operates over phonemes called ‘allophones.’ The goal of the model is to recover rule(s) which predicts which element of each allophone pair is an underlying form, and which is merely an allophone. The underlying phonemes are allowed in the lexicon, while the other allophones are not allowed in the lexicon and surface only due phonological rules. For example, an allophony substitution could be $$\left\{b\,\mapsto\, p,d\,\mapsto\, t,g\,\mapsto\, k\right\}$$. We extend such substitutions to total functions on phoneme sequences by applying the substitution to phonemes in its domain, and not applying it otherwise. We call this total function s(). For instance, using the previous example substitution, s(abkpg) = apkpk. Solving an allophone problem means finding rules that either map the domain of s() to its range (T1 below), or vice versa (T2 below):

$${{{{{{\bf{L}}}}}}}_{1}(m)=s(f)\;{{{{{\rm{when}}}}}}\,\exists \langle f,m\rangle \in {{{{{\bf{X}}}}}}\\ {{{{{{\bf{L}}}}}}}_{2}(m)={s}^{-1}(f)\,{{{{{\rm{when}}}}}}\,\exists \langle f,m\rangle \in {{{{{\bf{X}}}}}}\\ {{{{{\rm{For}}}}}}\,i\in \left\{1,2\right\}:\\ \quad{{{{{{\bf{T}}}}}}}_{i}=\mathop{{{\rm{arg max}}} }\limits_{{{{{{\bf{T}}}}}}}\,P({{{{{\bf{T}}}}}}|{{{{{\rm{UG}}}}}})P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{{\bf{L}}}}}}}_{i})\left(=\mathop{{{\rm{arg max}}}}\limits_{{{{{{\bf{T}}}}}}}P({{{{{\bf{X}}}}}},{{{{{\bf{T}}}}}},{{{{{{\bf{L}}}}}}}_{i}|{{{{{\rm{UG}}}}}})\right)$$
(7)

### Probabilistic framing

Our few-shot artificial grammar learning simulations require probabilistically scoring held-out unobserved words corresponding to unobserved stems. We now present a refactoring of our Bayesian learning setup that permits these calculations. Given rules T and lexicon L, we define a likelihood PLik over a paradigm matrix X when the data X contain stems disjoint from those in L:

$${P}^{{{{{{\rm{Lik}}}}}}}({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})=\mathop{\sum}\limits_{{{{{{\bf{L}}}}}}^{\prime} }P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )P({{{{{\bf{L}}}}}}^{\prime}|{{{{{\rm{UG}}}}}})$$
(8)

where $${{{{{\bf{L}}}}}}^{\prime}$$ ranges over lexica which assign forms to the stems present in X, i.e. $${{{{{\bf{L}}}}}}^{\prime} \ni \langle f^{\prime},{\mathtt{stem}},\sigma \rangle$$ iff X 〈f, [stem: σ;  i]〉 for some surface form f and some underlying form $$f^{\prime}$$. The term PLik can be lower bounded by taking the most likely underlying form for each stem:

$${P}^{{{{{{\rm{Lik}}}}}}}({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})\,\ge \,\mathop{\max }\limits_{{{{{{\bf{L}}}}}}^{\prime} }P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )P({{{{{\bf{L}}}}}}^{\prime}|{{{{{\rm{UG}}}}}})$$
(9)

This lower bound will be tightest when each paradigm row admits very few possible stems. Typically only one stem per row is consistent with the rules and affixes, which justifies this bound.

The connection between the Bayesian likelihood PLik and the MAP objective (Eq. (1)) can be seen by partitioning the lexicon into affixes (in L) and stems (in $${{{{{\bf{L}}}}}}^{\prime}$$), which also decomposes the objective into a parsimony-favoring prior and a fit-to-data favoring likelihood term:

$$\mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}}P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}}){P}^{{{{{{\rm{Lik}}}}}}}({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})\,\ge \,\mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}},{{{{{\bf{L}}}}}}^{\prime} }\underbrace{P({{{{{\bf{T}}}}},{{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}})}}_{\begin{array}{c}{{{{{\rm{prior}}}}}}\end{array}}\underbrace{P({{{{{\bf{L}}}}}^{\prime}|{{{{{\rm{UG}}}}}})P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )}}_{\begin{array}{c}{{{{{\rm{likelihood}}}}}}\end{array}}$$
(10)
$$=\mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}},{{{{{\bf{L}}}}}}^{\prime} }\underbrace{P({{{{{\bf{T}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime}|{{{{{\rm{UG}}}}}})P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )}}_{\begin{array}{c}={{{{{\rm{Eq}}}}}}.1\,{{{{{\rm{w}}}}}}/{{{{{\rm{lexicon}}}}}}\,{{{{{\rm{set}}}}}}\,{{{{{\rm{to}}}}}}\,{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} \end{array}}$$
(11)

### Few-shot artificial grammar learning

We present our system with a training set Xtrain of words from a target language, such as the ABA language (e.g., /wofewo/, /mikami/, ...). We model this training set as a paradigm matrix with a single column (single inflection), with each word corresponding to a different stem (a different row in the matrix). Then we compute the likelihood assigned to a held-out word Xtest either consistent with the target grammar (e.g., following the ABA pattern) or inconsistent with the target grammar (e.g., following the ABB pattern, such as /wofefe/, /mikaka/, ...). The probability assigned to a held-out test word, conditioned on the training set, is approximated by marginalizing over the Pareto-optimal grammars for the train set, rather than marginalizing over all possible grammars:

$$P({{{{{{\bf{X}}}}}}}_{{{{{{\rm{test}}}}}}}|{{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}}) =\mathop{\sum}\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}}P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}}){P}^{{{{{{\rm{Lik}}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{test}}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})\hfill\\ \approx \mathop{\sum}\limits_{\langle {{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\rangle \in {{{{{\rm{ParetoFrontier}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}})}\frac{P({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}},{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}){P}^{{{{{{\rm{Lik}}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{test}}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})}{{\sum }_{\langle {{{{{\bf{T}}}}}}^{\prime},{{{{{\bf{L}}}}}}^{\prime} \rangle \in {{{{{\rm{ParetoFrontier}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}})}P({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}},{{{{{\bf{T}}}}}}^{\prime},{{{{{\bf{L}}}}}}^{\prime} )}$$
(12)

which relies on the fact that Sketch has out-of-the-box support for finding Pareto-optimal solutions to multiobjective optimization problems26. We approximate the likelihood PLik(XtestT, L) using the lower bound in Eq. (9), equivalently finding the shortest stem which will generate the test word Xtest, given the affixes in L and the rules in T.

### Synthesizing a metatheory

At a high level, inference of the cross-language fragment grammar works by maximizing a variational-particle50 lower bound on the joint probability of the metatheory M and the D data sets, $${\{{{{{{{\bf{X}}}}}}}^{d}\}}_{d=1}^{D}$$:

$$\log P\left({{{{{\bf{M}}}}}},{\{{{{{{{\bf{X}}}}}}}^{d}\}}_{d=1}^{D}\right)\ge \log P({{{{{\bf{M}}}}}})+\mathop{\sum }\limits_{d=1}^{D}\log \mathop{\sum}\limits_{\begin{array}{c}\langle {{{{{{\bf{T}}}}}}}_{d},{{{{{{\bf{L}}}}}}}_{d}\rangle \in \\ {{{{{\rm{support}}}}}}[{Q}_{d}(\cdot )]\end{array}}P\left({{{{{{\bf{X}}}}}}}^{d}|{{{{{{\bf{T}}}}}}}_{d},{{{{{{\bf{L}}}}}}}_{d}\right)P\left({{{{{{\bf{T}}}}}}}_{d},{{{{{{\bf{L}}}}}}}_{d}|{{{{{\bf{M}}}}}}\right)$$
(13)

where this bound is written in terms of a set of variational approximate posteriors, $${\left\{{Q}_{d}\right\}}_{d=1}^{D}$$, whose support we constrain to be small, which ensures that the above objective is tractable. We alternate maximization with respect to M (i.e., inferring a fragment grammar from the theories in the supports of $${\left\{{Q}_{d}\right\}}_{d=1}^{D}$$), and maximization with respect to $${\left\{{Q}_{d}\right\}}_{d=1}^{D}$$ (i.e., finding a small set of theories for each data set that are likely under the current M). Our lower bound most increases when the support of each $${\left\{{Q}_{d}\right\}}_{d=1}^{D}$$ coincides with the top-k most likely theories, so at each round of optimization, we ask the program synthesizer to find the top k theories maximizing P(XdTd, Ld)P(Td, LdM). In practice, we find the top k = 100 theories for each data set.

We represent M by adapting the Fragment Grammars formalism43. Concretely, M is a probabilistic context free grammar (PCFG) that stochastically generates phonological rules. More precisely, M generates the syntax tree of a program which implements a phonological rule. In the Fragment Grammars formalism, one first defines a base grammar, which is a context-free grammar. Our base grammar is a context-free grammar over SPE rules (Supplementary Fig. 6). Learning the fragment grammar consists of adding new productions to this base grammar (the “fragments”), while also assigning probabilities to each production rule. Formally, each fragment is a subtree of a derivation of a tree generated from a non-terminal symbol in the base grammar; informally, each fragment is a template for a piece of a tree, and thus acts as a schema for a piece of a phonological rule. Learning a fragment grammar never changes the set of trees (i.e., programs and rules) that can be generated from the grammar. Instead, through a combination of estimating probabilities and defining new productions, it adjusts the probability of different trees. See Supplementary Fig. 6, which shows the symbolic structure of the learned fragment grammar.

This fragment grammar gives us a learned prior over single phonological rules. We define P(T, LM) by assuming that rules are generated independently and that M does not affect the prior probability of L:

$$P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{\bf{M}}}}}})=P({{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}})\mathop{\prod}\limits_{r\in {{{{{\bf{T}}}}}}}P(r|{{{{{\bf{M}}}}}})$$
(14)

Our prior over fragment grammars, P(M), works by following the original work in this space43 by assuming that fragments are generated sequentially, with new fragments generated from the current fragment grammar by stochastically sampling them from the current fragment grammar. This encourages shorter fragments, as well as reuse across fragments.

We depart from ref. 43 in our inference algorithm: while ref. 43 uses Markov Chain Monte Carlo methods to stochastically sample from the posterior over fragment grammars, we instead perform hillclimbing upon the objective in Eq. (13). Each round of hillclimbing proposes new fragments by antiunifying subtrees of phonological rules in $${\left\{{{{{{{\bf{T}}}}}}}_{d}\right\}}_{d=1}^{D}$$, and re-estimates the continuous parameters of the resulting PCFG using the classic Inside–Outside algorithm51. When running Inside-Outside we place a symmetric Dirichlet prior over the continuous parameters of the PCFG with pseudocounts equal to 1.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.