A key aspect of human intelligence is our ability to build theories about the world. This faculty is most clearly manifested in the historical development of science1 but also occurs in miniature in everyday cognition2 and during childhood development3. The similarities between the process of developing scientific theories and the way that children construct an understanding of the world around them have led to the child-as-scientist metaphor in developmental psychology, which views conceptual changes during development as a form of scientific theory discovery4,5. Thus, a key goal for both artificial intelligence and computational cognitive science is to develop methods to understand—and perhaps even automate—the process of theory discovery6,7,8,9,10,11,12,13.

In this paper, we study the problem of AI-driven theory discovery, using human language as a testbed. We primarily focus on the linguist’s construction of language-specific theories, and the linguist’s synthesis of abstract cross-language meta-theories, but we also propose connections to child language acquisition. The cognitive sciences of language have long drawn an explicit analogy between the working scientist constructing grammars of particular languages and the child learning their languages14,15. Language-specific grammar must be formulated within a common theoretical framework, sometimes called universal grammar. For the linguist, this is the target of empirical inquiry, for the child, this includes those linguistic resources that they bring to the table for language acquisition.

Natural language is an ideal domain to study theory discovery for several reasons. First, on a practical level, decades of work in linguistics, psycholinguistics, and other cognitive sciences of language provide diverse raw material to develop and test models of automated theory discovery. There exist corpora, data sets, and grammars from a large variety of typologically distinct languages, giving a rich and varied testbed for benchmarking theory induction algorithms. Second, children easily acquire language from quantities of data that are modest by the standards of modern artificial intelligence16,17,18. Similarly, working field linguists also develop grammars based on very small amounts of elicited data. These facts suggest that the child-as-linguist analogy is a productive one and that inducing theories of language is tractable from sparse data with the right inductive biases. Third, theories of language representation and learning are formulated in computational terms, exposing a suite of formalisms ready to be deployed by AI researchers. These three features of human language—the availability of a large number of highly diverse empirical targets, the interfaces with cognitive development, and the computational formalisms within linguistics—conspire to single out language as an especially suitable target for research in automated theory induction.

Ultimately, the goal of the language sciences is to understand the general representations, processes, and mechanisms that allow people to learn and use language, not merely to catalog and describe particular languages. To capture this framework-level aspect of the problem of theory induction, we adopt the paradigm of Bayesian Program Learning (BPL: see ref. 19). A BPL model of an inductive inference problem, such as theory and grammar induction, works by inferring a generative procedure represented as a symbolic program. Conditioned on the output of that program, the model uses Bayes’ rule to work backward from data (program outputs) to the procedure that generated it (a program). We embed classic linguistic formalisms within a programming language provided to a BPL learner. Only with this inductive bias can a BPL model then learn programs capturing a wide diversity of natural language phenomena. By systematically varying this inductive bias, we can study elements of the induction problem that span multiple languages. By doing hierarchical Bayesian inference on the programming language itself, we can also automatically discover some of these universal trends. But BPL comes at a steep computational cost, and so we develop new BPL algorithms which combine techniques from program synthesis with intuitions drawn from how scientists build theories and how children learn languages.

We focus on theories of natural language morpho-phonology—the domain of language governing the interaction of word formation and sound structure. For example, the English plurals for dogs, horses, and cats are pronounced /dagz/, /hɔrsәz/, and /kæts/, respectively (plural suffixes underlined; we follow the convention of writing phoneme sequences between slashes). Making sense of this data involves realizing that the plural suffix is actually /z/ (part of English morphology), but this suffix transforms depending on the sounds in the stem (English phonology). The suffix becomes /әz/ for horses (/hɔrsәz/) and other words ending in stridents such as /s/ or /z/; otherwise, the suffix becomes /s/ for cats (/kæts/) and other words ending in unvoiced consonants. Full English morphophonology explains other phenomena such as syllable stress and verb inflections. Figure 1a–c shows similar phenomena in Serbo-Croatian: just as English morphology builds the plural by adding /z/, Serbo-Croatian builds feminine forms by adding /a/. Just as English phonology inserts /ә/ at the end of /hɔrsәz/, Serbo-Croatian modifies a stem such as /yasn/ by inserting /a/ to get /yasan/. Discovering a language’s morphophonology means inferring its stems, prefixes, and suffixes (its morphemes), and also the phonological rules that predict how concatenations of these morphemes are actually pronounced. Thus acquiring the morpho-phonology of a language involves solving a basic problem confronting both linguists and children: to build theories of the relationships between form and meaning given a collection of utterances, together with aspects of their meanings.

Fig. 1: A morpho-phonology problem.
figure 1

a Serbo-Croatian data (simplified). This language’s morphology is illustrated for masculine and feminine forms. The data motivate a morphological rule which forms the feminine form by appending /a/. b illustrates a counterexample to this analysis: the masculine, feminine forms of clear are /yasan/, /yasna/. These pronunciations are explained by Serbo-Croatian phonology: the sound /a/ is inserted between pairs of consonants at the end of words, notated \(\varnothing \to \)a / C_C#. This rule requires that the true stem for /yasan/, /yasna/ is /yasn/. c shows further stems inferred for this data. These stems are stored in the lexicon.

We evaluate our BPL approach on 70 data sets spanning the morphophonology of 58 languages. These data sets come from phonology textbooks: they have high linguistic diversity, but are much simpler than full language learning, with tens to hundreds of words at most, and typically isolate just a handful of grammatical phenomena. We will then shift our focus from linguists to children, and show that the same approach for finding grammatical structure in natural language also captures classic findings in the infant artificial grammar learning literature. Finally, by performing hierarchical Bayesian inference across these linguistic data sets, we show that the model can distill universal cross-language patterns, and express those patterns in a compact, human understandable form. Collectively, these findings point the way toward more human-like AI systems for learning theories, and for systems that learn to learn those theories more effectively over time by refining their inductive biases.


One central problem of natural language learning is to acquire a grammar that describes some of the relationships between form (perception, articulation, etc.) and meaning (concepts, intentions, thoughts, etc.; Supplementary Discussion 1). We think of grammars as generating form-meaning pairs, 〈f, m〉, where each form corresponds to a sequence of phonemes and each meaning is a set of meaning features. For example, in English, the word opened has the form/meaning \(\left\langle /{{{\rm{op}}}}{\upvarepsilon}{{{\rm{nd}}}}/,\,[{{{{{\bf{stem}}}}}}:{{{{{\rm{OPEN}}}}}};{{{{{\bf{tense}}}}}}:{{{{{\rm{PAST}}}}}}]\right\rangle \), which the grammar builds from the form/meaning for open, namely \(\left\langle /{{{\rm{op}}}}{\upvarepsilon}{{{\rm{n}}}}/,\,[{{{{{\bf{stem}}}}}}:{{{{{\rm{OPEN}}}}}}]\right\rangle \), and the past-tense form/meaning, namely \(\left\langle /{{{{{\rm{d}}}}}}/,[{{{{{\bf{tense}}}}}}:{{{{{\rm{PAST}}}}}}]\right\rangle \). Such form-meaning pairs (stems, prefixes, suffixes) live in a part of the grammar called the lexicon (Fig. 1c). Together, morpho-phonology explains how word pronunciation varies systematically across inflections, and allows the speaker of a language to hear just a single example of a new word and immediately generate and comprehend all its inflected forms.


Our model explains a set X of form-meaning pairs 〈f, m〉 by inferring a theory (grammatical rules) T and lexicon L. For now, we consider maximum aposteriori (MAP) inference–which estimates a single 〈T, L〉–but later consider Bayesian uncertainty estimates over 〈T, L〉, and hierarchical modeling. This MAP inference seeks to maximize P(T, LUG)∏f, mXP(f, mT, L), where UG (for universal grammar) encapsulates higher-level abstract knowledge across different languages. We decompose each language-specific theory into separate modules for morphology and for phonology (Fig. 2). We handle inflectional classes (e.g. declensions) by exposing this information in the observed meanings, which follows the standard textbook problem structure but simplifies the full problem faced by children learning the language. In principle, our framing could be extended to learn these classes by introducing an extra latent variable for each stem corresponding to its inflectional class. We also restrict ourselves to concatenative morphology, which builds words by concatenating stems, prefixes, and suffixes. Nonconcatenative morphologies20—such as Tagalog’s reduplication, which copies syllables—are not handled. We assume that each morpheme is paired with a morphological category: either a prefix (pfx), suffix (sfx), or stem. We model the lexicon as a function from pairs of meanings and morphological categories to phonological forms. We model phonology as K ordered rules, written \({\left\{{r}_{k}\right\}}_{k=1}^{K}\), each of which is a function mapping sequences of phonemes to sequences of phonemes. Given these definitions, we express the theory-induction objective as:

$$\arg \mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}}P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}})\mathop{\prod}\limits_{\langle \, f,m\rangle \in {{{{{\bf{X}}}}}}} {\mathbb{1}}\left[f={{{{{\rm{Phonology}}}}}}({{{{{\rm{Morphology}}}}}}(m))\right]\hfill\\ {{{{{\rm{where}}}}}}\;{{{{{\rm{Morphology}}}}}}([{{{{{\bf{stem}}}}}}\!\!:\sigma ;\,i]) =\, {{{{{\bf{L}}}}}}(i,{\mathtt{pfx}})\cdot {{{{{\bf{L}}}}}}(\sigma,{\mathtt{stem}})\cdot {{{{{\bf{L}}}}}}(i,{\mathtt{sfx}})\\ \quad {{{concatenate}}}\;{{{prefix}}} ,\,{{{stem}}}\,,\,{{{suffix}}}\\ {{{{{\rm{Phonology}}}}}}(m) ={r}_{1}({r}_{2}(\cdots {r}_{K}(m)\cdots ))\\ \quad {apply}\,\,{ordered}\,\,{rewrite}\,\,{rules}$$

where [stem: σ;  i] is a meaning with stem σ, and i are the remaining aspects of meaning that exclude the stem (e.g., i could be [tense:PAST; gender:FEMALE]). The expression \({\mathbb{1}}\left[\cdot \right]\) equals 1 if its argument is true and 0 otherwise. In words, Eq. (1) seeks the highest probability theory that exactly reproduces the data, like classic MDL learners21. This equation forces the model to explain every word in terms of rules operating over concatenations of morphemes, and does not allow wholesale memorization of words in the lexicon. Eq. (1) assumes fusional morphology: every distinct combination of inflections fuses into a new prefix/suffix. This fusional assumption can emulate arbitrary concatenative morphology: although each inflection seems to have a single prefix/suffix, the lexicon can implicitly cache concatenations of morphemes. For instance, if the morpheme marking tense precedes the morpheme marking gender, then L([tense:PAST; gender:FEMALE], pfx) could equal L([tense:PAST], pfx) L([gender:FEMALE], pfx). We use a description-length prior for P(T, LUG) favoring compact lexica and fewer, less complex rules (Supplementary Methods 3.4).

Fig. 2: The generative model underlying our approach.
figure 2

We infer grammars (teal) for a range of languages, given only form/meaning pairs (orange) and a space of programs (purple). Form/meaning pairs are typically arranged in a stem × inflection matrix. For example, the lower right matrix entry for Catalan means we observe the form/meaning pair /grizə/,[stem:GREY; gender:FEM]. Grammars include phonology, which transforms concatenations of stems and affixes into the observed surface forms using a sequence of ordered rules, labeled r1, r2, etc. The grammar's lexicon contains stems, prefixes, and suffixes, and morphology concatenates different suffixes/prefixes to each stem for each inflection. ϵ refers to the empty string. Each rule is written as a context-dependent rewrite, and beneath it, an English description. In the lower black boxes, we show the inferred derivation of the observed data, i.e. the execution trace of the synthesized program. Grammars are expressed as programs drawn from a universal grammar, or space of allowed programs. Makonde and Catalan are illustrated here. Other examples are in Fig. 4 and Supplementary Figs. 13.

The data X typically come from a paradigm matrix, whose columns range over inflections and whose rows range over stems (Supplementary Methods 3.1). In this setting, an equivalent Bayesian framing (“Methods”) permits probabilistic scoring of new stems by treating the rules and affixes as a generative model over paradigm rows.

Representing rules and sounds

Phonemes (atomic sounds) are represented as vectors of binary features. For example, one such feature is nasal, for which e.g. /m/, /n/, are +nasal. Phonological rules operate over this feature space. To represent the space of such rules we adopt the classical formulation in terms of context-dependent rewrites22. These are sometimes called SPE-style rules since they were used extensively in the Sound Pattern of English22. Rules are written (focus) → (structural change)/(left trigger)_(right trigger), meaning that the focus phoneme(s) are transformed according to the structural change whenever the left/right triggering environments occur immediately to the left/right of the focus (Supplementary Fig. 5). Triggering environments specify conjunctions of features (characterizing sets of phonemes sometimes called natural classes). For example, in English, phonemes which are [−sonorant] (such as /d/) become [-voice] (e.g., /d/ becomes /t/) at the end of a word (written #) whenever the phoneme to the left is an unvoiced nonsonorant ([− voice − sonorant], such as /k/), written [-sonorant] → [-voice]/[-voice -sonorant]_#. This specific rule transforms the past tense walked from /wɔkd/ into its pronounced form /wɔkt/. The subscript 0 denotes zero or more repetitions of a feature matrix, called the “Kleene star” operator (i.e., [+ voice]0 means zero or more repetitions of [+ voice] phonemes). When such rules are restricted to not be able to cyclically apply to their own output, the rules and morphology correspond to 2-way rational functions, which in turn correspond to finite-state transducers23. It has been argued that the space of finite-state transductions has sufficient representational power to cover known empirical phenomenon in morpho-phonology and represents a limit on the descriptive power actually used by phonological theories, even those that are formally more powerful, including Optimality Theory24.

To learn such grammars, we adopt the approach of Bayesian Program Learning (BPL). In this setting, we model each T as a program in a programming language that captures domain-specific constraints on the problem space. The linguistic architecture common to all languages is often referred to as universal grammar. Our approach can be seen as a modern instantiation of a long-standing approach in linguistics that adopts human-understandable generative representations to formalize universal grammar22.


We have defined the problem a BPL theory inductor needs to solve, but have not given any guidance on how to solve it. In particular, the space of all programs is infinitely large and lacks the local smoothness exploited by local optimization algorithms like gradient descent or Markov Chain Monte Carlo. We adopt a strategy based on constraint-based program synthesis, where the optimization problem is translated into a combinatorial constraint satisfaction problem and solved using a Boolean Satisfiability (SAT) solver25. These solvers implement an exhaustive but relatively efficient search and guarantee that, given enough time, an optimal solution will be found. We use the Sketch26 program synthesizer, which can solve for the smallest grammar consistent with some data, subject to an upper bound on the grammar size (see “Methods”).

In practice, the clever exhaustive search techniques employed by SAT solvers fail to scale to the many rules needed to explain large corpora. To scale these solvers to large and complex theories, we take inspiration from a basic feature of how children acquire language and how scientists build theories. Children do not learn a language in one fell swoop, instead progressing through intermediate stages of linguistic development, gradually enriching their mastery of both grammar and lexicon. Similarly, a sophisticated scientific theory might start with a simple conceptual kernel, and then gradually grow to encompass more and more phenomena. Motivated by these observations, we engineered a program synthesis algorithm that starts with a small program, and then repeatedly uses a SAT solver to search for small modifications that allow it to explain more and more data. Concretely, we find a counterexample to our current theory, and then use the solver to exhaustively explore the space of all small modifications to the theory which can accommodate this counterexample. This combines ideas from counter-example guided inductive synthesis26 (which alternates synthesis with a verifier that feeds new counterexamples to the synthesizer) with test-driven synthesis27 (which synthesizes new conditional branches for each such counterexample); it also exposes opportunities for parallelism (Supplementary Methods 3.3). Figure 3 illustrates this incremental, solver-aided synthesis algorithm, while Supplementary Methods 3.3 gives a concrete walk-through of the first few iterations.

Fig. 3: Inference method for Bayesian Program Learning.
figure 3

To scale to large programs explaining large corpora, we repeatedly search for small modifications to our current theory. Such modifications are driven by counterexamples to the current theory. Blue:grammars. Red: search radius.

This heuristic approach lacks the completeness guarantee of SAT solving: it does not provably find an optimal solution, despite repeatedly invoking a complete, exact SAT solver. However, each such repeated invocation is much more tractable than direct optimization over the entirety of the data. This is because constraining each new theory to be close in theory-space to its preceding theory leads to polynomially smaller constraint satisfaction problems and therefore exponentially faster search times, because SAT solvers scale, in the worst case, exponentially with problem size.

Quantitative analysis

We apply our model to 70 problems from linguistics textbooks28,29,30. Each textbook problem requires synthesizing a theory of a number of forms drawn from some natural language. These problems span a wide range of difficulties and cover a diverse set of natural language phenomena. This includes tonal languages, for example, in Kerewe, to count is /kubala/, but to count it is /kukíbála/, where accents mark high tones; languages with vowel harmony, for example Turkish has /el/, /tan/ meaning hand, bell, respectively, and /el-ler/, /tan-lar/ for the plurals hands, bells, respectively (dashes inserted at suffix boundaries for clarity); and many other linguistic phenomena such as assimilation and epenthesis (Fig. 4 and Supplementary Figs. 13).

Fig. 4: Qualitative results on morpho-phonological grammar discovery illustrated on phonology textbook problems.
figure 4

The model observes form/meaning pairs (orange) and jointly infers both a language-specific theory (teal; phonological rules labeled r1, r2, ...) and a data set-specific lexicon (teal) containing stems and affixes. Together the theory and lexicon explain the orange data via a derivation where the morphology output (prefix+stem+suffix) is transformed according to the ordered rules. Notice interacting nonlocal rules in Kerewe, a language with tones. Notice multiple vowel harmony rules in Sakha. Supplementary Figs. 13 provide analogous illustrations of grammars with epenthesis (Yowlumne), stress (Serbo-Croatian), vowel harmony (Turkish, Hungarian, Yowlumne), assimilation (Lumasaaba), and representative partial failure cases on Yowlumne and Somali (where it recovers a partly correct rule set that fails to explain 20% of the data, while also illustrating spirantization).

We first measure the model’s ability to discover the correct lexicon. Compared to ground-truth lexica, our model finds grammars correctly matching the entirety of the problem’s lexicon for 60% of the benchmarks, and correctly explains the majority of the lexicon for 79% of the problems (Fig. 5a). Typically, the correct lexicon for each problem is less ambiguous than the correct rules, and any rules which generate the full data from the correct lexicon must be observationally equivalent to any ground truth rules we might posit. Thus, agreement with ground-truth lexica should act as a proxy for whether the synthesized rules have the correct behavior on the data, which should correlate with rule quality. To test this hypothesis we randomly sample 15 problems and grade the discovered rules, in consultation with a professional linguist (the second author). We measure both recall (the fraction of actual phonological rules correctly recovered) and precision (the fraction of recovered rules which actually occur). Rule accuracy, under both precision and recall, positively correlates with lexicon accuracy (Fig. 5c): when the system gets all the lexicon correct, it rarely introduces extraneous rules (high precision), and virtually always gets all the correct rules (high recall).

Fig. 5: Models applied to data from phonology textbooks.
figure 5

a Measuring % lexicon solved, which is the percentage of stems that match gold ground-truth annotations. Problems marked with an asterisk are allophony problems and are typically easier. For allophony problems, we count % solved as 0% when no rule explaining an alternation is found and 100% otherwise. For allophony problems, full/CEGIS models are equivalent, because we batch the full problem at once (Supplementary Methods 3). b Convergence rate of models evaluated on the 54 non-allophony problems. All models are run with a 24-h timeout on 40 cores. Only our full model can best tap this parallelism (Supplementary Methods 3.3). Our models typically converge within a half-day. SyPhon36 solves fewer problems but, of those it does solve, it takes minutes rather than hours. Curves show means over problems. Error bars show the standard error of the mean. c Rule accuracy was assessed by manually grading 15 random problems. Both precision and recall correlate with lexicon accuracy, and all three metrics are higher for easier problems requiring fewer phonological rules (red, easier; blue, harder). Requiring an exact match with a ground-truth stem occasionally allows solving some rules despite not matching any stems, as in the outlier problem marked with **. Pearson’s r confidence intervals (CI) were calculated with two-tailed test. Points were randomly jittered ±0.05 for visibility. Source data are provided as a Source data file.

Prior approaches to morphophonological process learning either abandon theory induction by learning black-box probabilistic models31, or induce interpretable models but do not scale to a wide range of challenging and realistic data sets. These interpretable alternatives include unsupervised distributional learners, such as the MDL genetic algorithm in Rasin et al.32, which learns from raw word frequencies. Other interpretable models leverage strong supervision: Albright et al.33 learns rules from input–outputs, while ref. 34 learns finite state transducers in the same setting. Other works attain strong theoretical learning guarantees by restricting the class of rules: e.g., ref. 35 considers 2-input strictly local functions. These interpretable approaches typically consider 2–3 simple rules at most. In contrast, Goldwater et al.34 scales to tens of rules on thousands of words by restricting itself to non-interacting local orthographic rules.

Our results hinge on several factors. A key ingredient is a correct set of constraints on the space of hypotheses, i.e. a universal grammar. We can systematically vary this factor: switching from phonological articulatory features to simpler acoustic features degrades performance (simple features in Fig. 5a, b). Our simpler acoustic features come from the first half of a standard phonology text28, while the articulatory features come from the latter half, so this comparison loosely models a contrast between novice and expert phonology students (Supplementary Methods 3.5). We can further remove two essential sources of representational power–Kleene star, which allows arbitrarily long-range dependencies, and phonological features, which allow analogizing and generalizing across phonemes. Removing these renders only the simplest problems solvable (-representation in Fig. 5a, b). Basic algorithmic details also matter. Building a large theory at once is harder for human learners, and also for our model (CEGIS in Fig. 5a, b). The recent SyPhon36 algorithm strikes a different and important point on the accuracy/coverage tradeoff: it aims to solve problems in seconds or minutes so that linguists can interactively use it. In contrast, our system’s average solution time is 3.6 h (Fig. 5b). SyPhon’s speed comes from strong independence assumptions between lexica and individual rules, and from disallowing non-local rules. These assumptions degrade coverage: SyPhon fails to solve 76% of our data set. We hope that their work and ours sets the stage for future systems that run interactively while also more fully modeling the richness and diversity of human language.

Child language generalization

If our model captures aspects of linguistic analysis from naturalistic data, and assuming linguists and children confront similar problems, then our approach should extend to model at least some aspects of the child’s linguistic generalization. Studying children (and adult’s) learning of carefully constructed artificial grammars has a long tradition in psycholinguistics and language acquisition37,38,39, because it permits controlled and careful study of the generalization of language-like patterns. We present our model with the artificial stimuli used in a number of AGL experiments38,39,40 (Fig. 6a), systematically varying the quantity of data given to the model (Fig. 6b). The model demonstrates few-shot inference of the same language patterns probed in classic infant studies of AGL.

Fig. 6: Modeling artificial grammar learning.
figure 6

a Children can few-shot learn many qualitatively different grammars, as studied in controlled conditions in AGL experiments. Our model learns these as well. Grammar names ABB/ABA/AAx/AxA refer to syllable structure: A/B are variable syllables, and x is a constant syllable. For example, ABB words have three syllables, with the last two syllables being identical. NB: Actual reduplication is subtler than syllable-copying20. b Model learns to discriminate between different artificial grammars by training on examples of grammar (e.g., AAB) and then testing on either unseen examples of words drawn from the same grammar (consistent condition, e.g., new words following the AAB pattern); or testing on unseen examples of words from a different grammar (inconsistent condition, e.g. new words following the ABA pattern), following the paradigm of ref. 39. We plot log-odds ratios of consistent and inconsistent conditions: \(\log P({{{{{\rm{consistent}}}}}}|{{{{{\rm{train}}}}}})/P({{{{{\rm{inconsistent}}}}}}|{{{{{\rm{train}}}}}})\) (“Methods”), over n = 15 random independent (in)consistent word pairs. Bars show mean log odds ratio over these 15 samples, individually shown as black points, with error bars showing stddev. We contrast models using program spaces both with and without syllabic representations, which were not used for textbook problems. Syllabic representation proves important for few-shot learning, but a model without syllables can still discriminate successfully given enough examples by learning rules that copy individual phonemes. See Supplementary Fig. 4 for more examples. Source data are provided as a Source data file.

These AGL stimuli contain very little data, and thus these few-shot learning problems admit a broad range of possible generalizations. Children select from this space of possible generalizations to select the linguistically plausible ones. Thus, rather than producing a single grammar, we use the model to search a massive space of possible grammars and then visualize all those grammars that are Pareto-optimal solutions41 to the trade-off between parsimony and fit to data. Here parsimony means size of rules and affixes (the prior in Eq. (10)); fit to data means average stem size (the likelihood in Eq. (10)); and a Pareto-optimal solution is one which is not worse than any other along both these competing axes. Figure 7 visualizes Pareto fronts for two classic artificial grammars while varying the number of example words provided to the learner, illustrating both the set of grammars entertained by the learner and how the learner weighs these grammars against each other. These figures show the exact contours of the Pareto frontier: these problems are small enough that exact SAT solving is tractable over the entire search space, so our heuristic incremental synthesizer is unneeded. With more examples the shape of the Pareto frontier develops a sharp kink around the correct generalization; with fewer examples, the frontier is smoother and more diffuse. By explaining both natural language data and AGL studies, we see our model as delivering on a basic hypothesis underpinning AGL research: that artificial grammar learning must engage some cognitive resource shared with first language acquisition. To the extent that this hypothesis holds, we should expect an overlap between models capable of learning real linguistic phenomena, like ours, and models of AGL phenomena.

Fig. 7: Modeling ambiguity in language learning.
figure 7

Few-shot learning of language patterns can be highly ambiguous as to the correct grammar. Here we visualize the geometry of generalization for several natural and artificial grammar learning problems. These visualizations are Pareto frontiers: the set of solutions consistent with the data that optimally trade-off between parsimony and fit to data. We show Pareto fronts for ABB (ref. 39; top two) & AAX (Gerken53; bottom right, data drawn from isomorphic phenomena in Mandarin) AGL problems for either one example word (upper left) or three example words (right column). In the bottom left we show the Pareto frontier for a textbook Polish morpho-phonology problem. Rightward on x-axis corresponds to more parsimonious grammars (smaller rule size + affix size) and upward on y-axis corresponds to grammars that best fit the data (smaller stem size), so the best grammars live in the upper right corners of these graphs. N.B.: Because the grammars and lexica vary in size across panels, the x and y axes have different scales in each panel. Pink shade: correct grammar. As the number of examples increases, the Pareto fronts develop a sharp kink around the correct grammar, which indicates a stronger preference for the correct grammar. With one example the kinks can still exist but are less pronounced. The blue lines provably show the exact contour of the Pareto frontier, up to the bound on the number of rules. This precision is owed to our use of exact constraint solvers. We show the Polish problem because the textbook author accidentally chose data with an unintended extra pattern: all stems vowels are /o/ or /u/, which the upper left solution encodes via an insertion rule. Although the Polish MAP solution is correct, the Pareto frontier can reveal other possible analyses such as this one, thereby serving as a kind of linguistic debugging. Source data are provided as a Source data file.

Synthesizing higher-level theoretical knowledge

No theory is built from scratch: Instead, researchers borrow concepts from existing frameworks, make analogies with other successful theories, and adapt general principles to specific cases. Through analysis and modeling of many different languages, phonologists (and linguists more generally) develop overarching meta-models that restrict and bias the space of allowed grammars. They also develop the phonological common sense that allows them to infer grammars from sparse data, knowing which rule systems are plausible based on their prior knowledge of human language, and which systems are implausible or simply unattested. For example, many languages devoice word-final obstruents, but almost no language voices word-final obstruents (cf. Lezgian42). This cross-theory common-sense is found in other sciences. For example, physicists know which potential energy functions tend to occur in practice (radially symmetric, pairwise, etc.). Thus a key objective for our work is the automatic discovery of a cross-language metamodel capable of imparting phonological common sense.

Conceptually, this meta-theorizing corresponds to estimating a prior, M, over language-specific theories, and performing hierarchical Bayesian inference across many languages. Concretely, we think of the meta-theory M as being a set of schematic, highly reusable phonological-rule templates, encoded as a probabilistic grammar over the structure of phonological rules, and we will estimate both the structure and the parameters of this grammar jointly with the solutions to textbook phonology problems. To formalize a set of meta-theories and define a prior over that set, we use the Fragment Grammars formalism43, a probabilistic grammar learning setup that caches and reuses fragments of commonly used rule subparts. Assuming we have a collection of D data sets (e.g., from different languages), notated \(\{{{{{\bf{X}}}}}^{d}\}_{d=1}^{D}\), our model constructs D grammars, \({\{\langle {{{{{{\bf{T}}}}}}}^{d},{{{{{{\bf{L}}}}}}}^{d}\rangle \}}_{d=1}^{D}\), along with a meta-theory M, seeking to maximize

$$P({{{{{\bf{M}}}}}})\mathop{\prod }\limits_{d=1}^{D}P({{{{{{\bf{T}}}}}}}^{d},{{{{{{\bf{L}}}}}}}^{d}|{{{{{\bf{M}}}}}})P({{{{{{\bf{X}}}}}}}^{d}|{{{{{{\bf{T}}}}}}}^{d},{{{{{{\bf{L}}}}}}}^{d})$$

where P(M) is a prior on fragment grammars over SPE-style rules. In practice, jointly optimizing over the space of Ms and grammars is intractable, and so we instead alternate between finding high-probability grammars under our current M, and then shifting our inductive bias, M, to more closely match the current grammars. We estimate M by applying this procedure to a training subset comprising 30 problems, chosen to exemplify a range of distinct phenomena, and then applied this M to all 70 problems. Critically this unsupervised procedure is not given access to any ground-truth solutions to the training subset.

This machine-discovered higher-level knowledge serves two functions. First, it is a form of human understandable knowledge: manually inspecting the contents of the fragment grammar reveals cross-language motifs previously discovered by linguists (Fig. 8c). Second, it can be critical to actually getting these problems correct (Fig. 8a, b and middle column of Fig. 8c). This occurs because a better inductive bias steers the incremental synthesizer toward more promising avenues, which decreases its chances of getting stuck in a neighborhood of the search space where no incremental modification offers improvement.

Fig. 8: Discovering and using a cross-language metatheory.
figure 8

a Re-solving the hardest textbook problems using the learned fragment grammar metatheory leads to an average of 31% more of the problem being solved. b illustrates a case where these discovered tendencies allow the model to find a set of six interacting rules solving the entirety of an unusually complex problem. c The metatheory comprises rule schemas that are human understandable and often correspond to motifs previously identified within linguistics. Left column shows four out of 21 induced rule schemas (Supplementary Fig. 6), which encode cross-language tendencies. These learned schemas include vowel harmony and spirantization (a process where stops become fricatives near vowels). The symbol FM means a slot that can hold any feature matrix, and trigger means a slot that can hold any rule triggering context. Middle column shows model output when solving each language in isolation: these solutions can be overly specific (Koasati, Bukusu), overly general (Kerewe, Turkish), or even essentially unrelated to the correct generalization (Tibetan). Right column shows model output when solving problems jointly with inferring a metatheory. Source data are provided as a Source Data file.

To be clear, our mechanized meta-theorizing is not an attempt to learn universal grammar (cf. ref. 44). Rather than capture a learning process, our meta-theorizing is analogous to a discovery process that distills knowledge of typological tendencies, thereby aiding future model synthesis. However, we believe that children possess implicit knowledge of these and other tendencies, which contributes to their skills as language learners. Similarly, we believe the linguist’s skill in analysis draws on an explicit understanding of these and other cross-linguistic trends.


Our high-level goal was to engineer methods for synthesizing interpretable theories, using morphophonology as a testbed and linguistic analysis as inspiration. Our results give a working demonstration that it is possible to automatically discover human-understandable knowledge about the structure of natural language. Like linguists, optimal inference hinges on higher-level biases and constraints; but the toolkit developed here permits systematic probing of these abstract assumptions and data-driven discovery of cross-language trends. Our work speaks to a long-standing analogy between the problems confronting children and linguists, and computationally cashes out the basic assumptions that underpin infant and child studies of artificial grammar learning.

Within phonology, our work offers a computational tool that can be used to study different grammatical hypotheses: mapping and scoring analyses under different objective functions, and studying the implications of different inductive biases and representations across a suite of languages. This toolkit can spur quantitative studies of classic phonological problems, such as probing extensionally-equivalent analyses (e.g., distinguishing deletion from epenthesis).

More broadly, the tools and approaches developed here suggest routes for machines that learn the causal structure of the world, while representing their knowledge in a format that can be reused and communicated to other agents, both natural and artificial. While this goal remains far off, it is worth taking stock of where this work leaves us on the path toward a theory induction machine: what are the prospects for scaling an approach like ours to other domains of language, or other domains of science more broadly? Scaling to the full linguistic hierarchy—acoustics, phonotactics, syntax, semantics, pragmatics—requires more powerful programming languages for expressing symbolic rules, and more scalable inference procedures, because although the textbook problems we solve are harder than prior work tackles, full morpho-phonology remains larger and more intricate than the problems considered here. More fundamentally, however, we advocate for hybrid neuro-symbolic models45,46,47 to capture crisp systematic productivity alongside more graded linguistic generalizations, such as that embodied by distributional models of language structure48.

Scaling to real scientific discovery demands fundamental innovations, but holds promise. Unlike language acquisition, genuinely new scientific theories are hard-won, developing over timescales that can span a decade or more. They involve the development of new formal substrates and new vocabularies of concepts, such as force in physics and allele in biology. We suggest three lines of attack. Drawing inspiration from conceptual role semantics49, future automated theory builders could introduce and define new theoretical objects in terms of their interrelations to other elements of the theory’s conceptual repertoire, only at the end grounding out in testable predictions. Drawing on the findings of our work here, the most promising domains are those which are solvable, in some version, by both child learners and adult scientists. This means first investigating sciences with counterparts in intuitive theories, such as classical mechanics (and intuitive physics), or cognitive science (and folk psychology). Building on the findings here and in ref. 11, a crucial element of theory induction will be the joint solving of many interrelated model building problems, followed by the synthesis of abstract over-hypotheses that encapsulate the core theoretical principles while simultaneously accelerating future induction through shared statistical strength.

Theory induction is a grand challenge for AI, and our work here captures only small slices of the theory-building process. Like our model, human theorists do craft models by examining experimental data, but they also propose new theories by unifying existing theoretical frameworks, performing thought experiments, and inventing new formalisms. Humans also deploy their theories more richly than our model: proposing new experiments to test theoretical predictions, engineering new tools based on the conclusions of a theory, and distilling higher-level knowledge that goes far beyond what our Fragment-Grammar approximation can represent. Continuing to push theory induction along these many dimensions remains a prime target for future research.


Program synthesis

We use the Sketch26 program synthesizer. Sketch can solve the following constrained optimization problem, which is equivalent to our goal of maximizing P(XT, L)P(T, LUG):

$$\begin{array}{ll}{{{{{\rm{maximize}}}}}}&F({{{{{\bf{X}}}}}},{{{{{\bf{T}}}}}})=\mathop{\sum }\limits_{k=1}^{K}\log P({r}_{k}|{{{{{\rm{UG}}}}}})+\mathop{\sum}\limits_{\langle f,c,m\rangle \in {{{{{\bf{L}}}}}}}\log P(f|{{{{{\rm{UG}}}}}})\hfill\\ {{{{{\rm{subject}}}}}}\,{{{{{\rm{to}}}}}}&\hskip -90pt C({{{{{\bf{X}}}}}},{{{{{\bf{T}}}}}})=\forall \,\langle f,\,[{{{{{\bf{stem}}}}}}\!:\sigma ;\,i]\rangle \in {{{{{\bf{X}}}}}}:\\ &f={r}_{1}\left(\cdots {r}_{K}({{{{{\bf{L}}}}}}(\langle i,{\mathtt{pfx}}\rangle )\cdot {{{{{\bf{L}}}}}}(\langle \sigma,{\mathtt{stem}}\rangle )\cdot {{{{{\bf{L}}}}}}(\langle i,{\mathtt{sfx}}\rangle ))\cdots \,\right)\end{array}$$

given observations X and bound on the number of rules K.

Sketch offers an exhaustive search strategy, but we use incremental solving in order to scale to large grammars. Mathematically this works as follows: we iteratively construct a sequence of theories T0, T1, ... alongside successively larger data sets X0, X1, ... converging to the full data set X, such that the tth theory Tt explains data set Xt, and successive theories are close to one another as measured by edit distance:

$${{{{{{\bf{X}}}}}}}_{t+1}={{{{{{\bf{X}}}}}}}_{t}\cup ({{{{{\rm{a}}}}}}\,{{{{{\rm{set}}}}}}\,{{{{{\bf{X}}}}}}^{\prime} \subseteq \,{{{{{\bf{X}}}}}}\,{{{{{\rm{where}}}}}}\,\neg \,C({{{{{{\bf{T}}}}}}}_{t},{{{{{\bf{X}}}}}}^{\prime} ))$$
$${D}_{t+1}=\mathop{\min }\limits_{D}D,\, {{{{{\rm{such}}}}}}\,{{{{{\rm{that}}}}}}: {{{{\exists}}}}\,{{{{{\bf{T}}}}}}\,{{{{{\rm{where}}}}}}\,C({{{{{\bf{T}}}}}},{{{{{{\bf{X}}}}}}}_{t+1})\,\,{{{{{\rm{and}}}}}}\,d({{{{{\bf{T}}}}}},{{{{{{\bf{T}}}}}}}_{t})\,\le \,D$$
$${{{{{{\bf{T}}}}}}}_{t+1}=\arg \mathop{\max }\limits_{{{{{{\bf{T}}}}}}}F({{{{{{\bf{X}}}}}}}_{t+1},{{{{{\bf{T}}}}}}),{{{{{\rm{such}}}}}}\,{{{{{\rm{that}}}}}}:{{{{{\bf{T}}}}}}\,{{{{{\rm{satisfies}}}}}}\,C({{{{{\bf{T}}}}}},{{{{{{\bf{X}}}}}}}_{t+1})\,{{{{{\rm{and}}}}}}\,d({{{{{\bf{T}}}}}},{{{{{{\bf{T}}}}}}}_{t})\, \le\,{D}_{t+1}$$

where d(  ,  ) measures edit distance between theories, Dt+1 is the edit distance between the theory at iteration t + 1 and t, and we use the t = 0 base cases \({{{{{{\bf{X}}}}}}}_{0}=\varnothing \) and T0 is an empty theory containing no rules. We “minibatch” counterexamples to the current theory (\({{{{{\bf{X}}}}}}^{\prime} \) in Eq. (4)) grouped by lexeme, and ordered by their occurrence in the data (e.g., if the theory fails to explain walk/walks/walked, and this is the next example in the data, then the surface forms of walk/walks/walked will be added to Xt+1). See Supplementary Methods 3.3.

We implement all models as Python 2.7 scripts that invoke Sketch 1.7.5, and also use Python 2.7 for all data analysis.

Allophony problems

Allophony problems comprise the observed form-meaning set X, as well as a substitution, which is a partial function mapping phonemes to phonemes (see Supplementary Methods 3.1). This mapping operates over phonemes called ‘allophones.’ The goal of the model is to recover rule(s) which predicts which element of each allophone pair is an underlying form, and which is merely an allophone. The underlying phonemes are allowed in the lexicon, while the other allophones are not allowed in the lexicon and surface only due phonological rules. For example, an allophony substitution could be \(\left\{b\,\mapsto\, p,d\,\mapsto\, t,g\,\mapsto\, k\right\}\). We extend such substitutions to total functions on phoneme sequences by applying the substitution to phonemes in its domain, and not applying it otherwise. We call this total function s(). For instance, using the previous example substitution, s(abkpg) = apkpk. Solving an allophone problem means finding rules that either map the domain of s() to its range (T1 below), or vice versa (T2 below):

$$ {{{{{{\bf{L}}}}}}}_{1}(m)=s(f)\;{{{{{\rm{when}}}}}}\,\exists \langle f,m\rangle \in {{{{{\bf{X}}}}}}\\ {{{{{{\bf{L}}}}}}}_{2}(m)={s}^{-1}(f)\,{{{{{\rm{when}}}}}}\,\exists \langle f,m\rangle \in {{{{{\bf{X}}}}}}\\ {{{{{\rm{For}}}}}}\,i\in \left\{1,2\right\}:\\ \quad{{{{{{\bf{T}}}}}}}_{i}=\mathop{{{\rm{arg max}}} }\limits_{{{{{{\bf{T}}}}}}}\,P({{{{{\bf{T}}}}}}|{{{{{\rm{UG}}}}}})P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{{\bf{L}}}}}}}_{i})\left(=\mathop{{{\rm{arg max}}}}\limits_{{{{{{\bf{T}}}}}}}P({{{{{\bf{X}}}}}},{{{{{\bf{T}}}}}},{{{{{{\bf{L}}}}}}}_{i}|{{{{{\rm{UG}}}}}})\right)$$

Probabilistic framing

Our few-shot artificial grammar learning simulations require probabilistically scoring held-out unobserved words corresponding to unobserved stems. We now present a refactoring of our Bayesian learning setup that permits these calculations. Given rules T and lexicon L, we define a likelihood PLik over a paradigm matrix X when the data X contain stems disjoint from those in L:

$${P}^{{{{{{\rm{Lik}}}}}}}({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})=\mathop{\sum}\limits_{{{{{{\bf{L}}}}}}^{\prime} }P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )P({{{{{\bf{L}}}}}}^{\prime}|{{{{{\rm{UG}}}}}})$$

where \({{{{{\bf{L}}}}}}^{\prime} \) ranges over lexica which assign forms to the stems present in X, i.e. \({{{{{\bf{L}}}}}}^{\prime} \ni \langle f^{\prime},{\mathtt{stem}},\sigma \rangle \) iff X 〈f, [stem: σ;  i]〉 for some surface form f and some underlying form \(f^{\prime} \). The term PLik can be lower bounded by taking the most likely underlying form for each stem:

$${P}^{{{{{{\rm{Lik}}}}}}}({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})\,\ge \,\mathop{\max }\limits_{{{{{{\bf{L}}}}}}^{\prime} }P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )P({{{{{\bf{L}}}}}}^{\prime}|{{{{{\rm{UG}}}}}})$$

This lower bound will be tightest when each paradigm row admits very few possible stems. Typically only one stem per row is consistent with the rules and affixes, which justifies this bound.

The connection between the Bayesian likelihood PLik and the MAP objective (Eq. (1)) can be seen by partitioning the lexicon into affixes (in L) and stems (in \({{{{{\bf{L}}}}}}^{\prime} \)), which also decomposes the objective into a parsimony-favoring prior and a fit-to-data favoring likelihood term:

$$\mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}}P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}}){P}^{{{{{{\rm{Lik}}}}}}}({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})\,\ge \,\mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}},{{{{{\bf{L}}}}}}^{\prime} }\underbrace{P({{{{{\bf{T}}}}},{{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}})}}_{\begin{array}{c}{{{{{\rm{prior}}}}}}\end{array}}\underbrace{P({{{{{\bf{L}}}}}^{\prime}|{{{{{\rm{UG}}}}}})P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )}}_{\begin{array}{c}{{{{{\rm{likelihood}}}}}}\end{array}}$$
$$=\mathop{\max }\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}},{{{{{\bf{L}}}}}}^{\prime} }\underbrace{P({{{{{\bf{T}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime}|{{{{{\rm{UG}}}}}})P({{{{{\bf{X}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} )}}_{\begin{array}{c}={{{{{\rm{Eq}}}}}}.1\,{{{{{\rm{w}}}}}}/{{{{{\rm{lexicon}}}}}}\,{{{{{\rm{set}}}}}}\,{{{{{\rm{to}}}}}}\,{{{{{\bf{L}}}}}}\cup {{{{{\bf{L}}}}}}^{\prime} \end{array}}$$

Few-shot artificial grammar learning

We present our system with a training set Xtrain of words from a target language, such as the ABA language (e.g., /wofewo/, /mikami/, ...). We model this training set as a paradigm matrix with a single column (single inflection), with each word corresponding to a different stem (a different row in the matrix). Then we compute the likelihood assigned to a held-out word Xtest either consistent with the target grammar (e.g., following the ABA pattern) or inconsistent with the target grammar (e.g., following the ABB pattern, such as /wofefe/, /mikaka/, ...). The probability assigned to a held-out test word, conditioned on the training set, is approximated by marginalizing over the Pareto-optimal grammars for the train set, rather than marginalizing over all possible grammars:

$$P({{{{{{\bf{X}}}}}}}_{{{{{{\rm{test}}}}}}}|{{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}}) =\mathop{\sum}\limits_{{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}}P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}}){P}^{{{{{{\rm{Lik}}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{test}}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})\hfill\\ \approx \mathop{\sum}\limits_{\langle {{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}\rangle \in {{{{{\rm{ParetoFrontier}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}})}\frac{P({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}},{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}){P}^{{{{{{\rm{Lik}}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{test}}}}}}}|{{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}})}{{\sum }_{\langle {{{{{\bf{T}}}}}}^{\prime},{{{{{\bf{L}}}}}}^{\prime} \rangle \in {{{{{\rm{ParetoFrontier}}}}}}({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}})}P({{{{{{\bf{X}}}}}}}_{{{{{{\rm{train}}}}}}},{{{{{\bf{T}}}}}}^{\prime},{{{{{\bf{L}}}}}}^{\prime} )}$$

which relies on the fact that Sketch has out-of-the-box support for finding Pareto-optimal solutions to multiobjective optimization problems26. We approximate the likelihood PLik(XtestT, L) using the lower bound in Eq. (9), equivalently finding the shortest stem which will generate the test word Xtest, given the affixes in L and the rules in T.

Synthesizing a metatheory

At a high level, inference of the cross-language fragment grammar works by maximizing a variational-particle50 lower bound on the joint probability of the metatheory M and the D data sets, \({\{{{{{{{\bf{X}}}}}}}^{d}\}}_{d=1}^{D}\):

$$\log P\left({{{{{\bf{M}}}}}},{\{{{{{{{\bf{X}}}}}}}^{d}\}}_{d=1}^{D}\right)\ge \log P({{{{{\bf{M}}}}}})+\mathop{\sum }\limits_{d=1}^{D}\log \mathop{\sum}\limits_{\begin{array}{c}\langle {{{{{{\bf{T}}}}}}}_{d},{{{{{{\bf{L}}}}}}}_{d}\rangle \in \\ {{{{{\rm{support}}}}}}[{Q}_{d}(\cdot )]\end{array}}P\left({{{{{{\bf{X}}}}}}}^{d}|{{{{{{\bf{T}}}}}}}_{d},{{{{{{\bf{L}}}}}}}_{d}\right)P\left({{{{{{\bf{T}}}}}}}_{d},{{{{{{\bf{L}}}}}}}_{d}|{{{{{\bf{M}}}}}}\right)$$

where this bound is written in terms of a set of variational approximate posteriors, \({\left\{{Q}_{d}\right\}}_{d=1}^{D}\), whose support we constrain to be small, which ensures that the above objective is tractable. We alternate maximization with respect to M (i.e., inferring a fragment grammar from the theories in the supports of \({\left\{{Q}_{d}\right\}}_{d=1}^{D}\)), and maximization with respect to \({\left\{{Q}_{d}\right\}}_{d=1}^{D}\) (i.e., finding a small set of theories for each data set that are likely under the current M). Our lower bound most increases when the support of each \({\left\{{Q}_{d}\right\}}_{d=1}^{D}\) coincides with the top-k most likely theories, so at each round of optimization, we ask the program synthesizer to find the top k theories maximizing P(XdTd, Ld)P(Td, LdM). In practice, we find the top k = 100 theories for each data set.

We represent M by adapting the Fragment Grammars formalism43. Concretely, M is a probabilistic context free grammar (PCFG) that stochastically generates phonological rules. More precisely, M generates the syntax tree of a program which implements a phonological rule. In the Fragment Grammars formalism, one first defines a base grammar, which is a context-free grammar. Our base grammar is a context-free grammar over SPE rules (Supplementary Fig. 6). Learning the fragment grammar consists of adding new productions to this base grammar (the “fragments”), while also assigning probabilities to each production rule. Formally, each fragment is a subtree of a derivation of a tree generated from a non-terminal symbol in the base grammar; informally, each fragment is a template for a piece of a tree, and thus acts as a schema for a piece of a phonological rule. Learning a fragment grammar never changes the set of trees (i.e., programs and rules) that can be generated from the grammar. Instead, through a combination of estimating probabilities and defining new productions, it adjusts the probability of different trees. See Supplementary Fig. 6, which shows the symbolic structure of the learned fragment grammar.

This fragment grammar gives us a learned prior over single phonological rules. We define P(T, LM) by assuming that rules are generated independently and that M does not affect the prior probability of L:

$$P({{{{{\bf{T}}}}}},{{{{{\bf{L}}}}}}|{{{{{\bf{M}}}}}})=P({{{{{\bf{L}}}}}}|{{{{{\rm{UG}}}}}})\mathop{\prod}\limits_{r\in {{{{{\bf{T}}}}}}}P(r|{{{{{\bf{M}}}}}})$$

Our prior over fragment grammars, P(M), works by following the original work in this space43 by assuming that fragments are generated sequentially, with new fragments generated from the current fragment grammar by stochastically sampling them from the current fragment grammar. This encourages shorter fragments, as well as reuse across fragments.

We depart from ref. 43 in our inference algorithm: while ref. 43 uses Markov Chain Monte Carlo methods to stochastically sample from the posterior over fragment grammars, we instead perform hillclimbing upon the objective in Eq. (13). Each round of hillclimbing proposes new fragments by antiunifying subtrees of phonological rules in \({\left\{{{{{{{\bf{T}}}}}}}_{d}\right\}}_{d=1}^{D}\), and re-estimates the continuous parameters of the resulting PCFG using the classic Inside–Outside algorithm51. When running Inside-Outside we place a symmetric Dirichlet prior over the continuous parameters of the PCFG with pseudocounts equal to 1.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.