Multiplex lexical networks reveal patterns in early word acquisition in children

Network models of language have provided a way of linking cognitive processes to language structure. However, current approaches focus only on one linguistic relationship at a time, missing the complex multi-relational nature of language. In this work, we overcome this limitation by modelling the mental lexicon of English-speaking toddlers as a multiplex lexical network, i.e. a multi-layered network where N = 529 words/nodes are connected according to four relationship: (i) free association, (ii) feature sharing, (iii) co-occurrence, and (iv) phonological similarity. We investigate the topology of the resulting multiplex and then proceed to evaluate single layers and the full multiplex structure on their ability to predict empirically observed age of acquisition data of English speaking toddlers. We find that the multiplex topology is an important proxy of the cognitive processes of acquisition, capable of capturing emergent lexicon structure. In fact, we show that the multiplex structure is fundamentally more powerful than individual layers in predicting the ordering with which words are acquired. Furthermore, multiplex analysis allows for a quantification of distinct phases of lexical acquisition in early learners: while initially all the multiplex layers contribute to word learning, after about month 23 free associations take the lead in driving word acquisition.

In the following text we provide additional details to the main paper. In Section 1 we give further details about the age of acquisition dataset and characterise the normative ordering. Further technical details on network definitions are given in Section 2. In Section 3 we briefly review the use of single layer network measures referenced in the main text. In the same section we also report and discuss a k-core analysis of the individual MLN layers, in order to explore core-periphery structure of the MLN. Section 4 contains an overview of the multiplex measures adopted in the main text. Section 5 provides some additional analytical results on the random word guessing baseline, which was used as a reference for the overlap word measures introduced in the Methods section of the main paper. Section 6 reports the word gains for many additional ordering experiments carried out and only cited in the main text. Orderings are either based on network features or on word characteristics. Section 7 discusses the optimisation results obtained for betweenness centrality and local clustering, where each layer of the MLN has a di↵erent influence. We conclude with Section 8, where we test the influence of individual word features on the predictability outcome of the optimisation procedure.

Vocabulary size of children over time
In order to establish an empirical relationship between the number of words learned (i.e. the vocabulary size) and the actual age of children, we considered CDI data collected in a research setting and publicly available [21]. These CDI data exhibit the same trends as the original norming data used in the main text [13] but allow for identification of a child's specific vocabulary size. A linear fitting y = ax + b of the number of words x learned by children of age in months y up to month 30 allowed for us to attribute an average age to children of a given vocabulary size (a = 0.0151 ± 0.0006, b = 19.7 ± 0.2, adjusted R squared R 2 = 0.53). Notice that despite individual points in the dataset ranging between ages of 16 up to 30 months, the dependent variable of the linear fit only ranges between 19 and 28 months (cf. SI Fig. S1). This estimated information was used for the plots in Figures 2 and 3 of the main text.

Edge definitions
Every node in the MLN represents a word and is replicated across layers. The association and feature norm layers are semantic, in that their relationships provide information about word meanings, while the phonological layer is indeed phonological, i.e. it is based on similarity patterns across word pronunciations. The layer of co-occurrences in child directed speech likely contains information related to semantic, syntactic, and phonological similarity. Notice that the order of layer is not important in our MLN framework.
The free association layer is based on the empirical University of South Florida Free Association Norms [33]. The dataset was built over almost 750,000 empirical free association pairs produced by 6,000 participants as responses to 5,019 stimulus words. Participants were asked to indicate the first target word that came to mind which was related to the presented cue word.
Supplementary Figure 1: Linear fitting y = ax + b of the number of words x learned by children of age in months y up to month 30 from the dataset. The overlay indicates standard errors estimated from the fitting procedure.
The free association norms were obtained by considering only statistically significant associations. From these norms directed edges were retrieved from a cue word (e.g. "eat") to a target word (e.g. "food") with a given normalized frequency (e.g. 0.41). Therefore a weighted directed edge from word A to word B in the original dataset means that "A is freely associated with B". When restricted to the overlap with the CDI vocabulary check list, the network contained 529 words. This CDI association network displayed a reciprocity of 0.38 (i.e. 38% of the edges were bidirectional). However, in this preliminary investigation of multiplex lexical networks, we considered only non-weighted and undirected network layers for the sake of simplicity. We ignored edge weights and converted links from directed to undirected. Hence, in the MLN an edge between words A and B means that one of them is empirically associated with the other (e.g. "food" could be a response to "eat" or "eat" a response to "food"). This undirected network representation used by Steyvers et al. [38] and Hills et al. [25] in their modeling of early language growth using network representations. Instead, a directed version was used in Beckage et al. [5].
The feature norms layer is based on the McRae feature norms dataset [30], i.e. a set of feature norms collected from approximately 725 participants for 541 living (e.g. dog) and nonliving (e.g. chair) basic (noun) concepts. Participants were asked to list semantic features of each concept, capturing the most salient and most relevant features. In the semantic features layer generated from this feature norming study, words A and B are connected if they share at least X = 1 semantic features (e.g. "balloon" and "ball" share the SHAPE feature).
The co-occurrence layer is based on the CHILDES dataset [29] and it considers word cooccurrences based on child directed speech. In the co-occurrence layer an edge exists between word A and B if word A occurs within five words before or after word B. Because of the high rate of spurious connections, the co-occurrence between two words must have a frequency higher than a given threshold C to be considered as an edge in our graph. We set C = 45 in order for the co-occurrence layer to display a connectivity as close as possible to that of the association and feature norms layers. As an example words, "be" and "back" co-occur more than 45 times in the dataset and are therefore connected in the co-occurrence layer of the MLN. From a cognitive perspective, spurious connections are expected in children co-occurrence data because sentences from or directed to children tend to be short and because topics tend to change often [29].
The phonological layer of the MLN is based on phonological similarity among words. We computed similarities based on the IPA phonological word transcriptions obtained from Word-Net 3.0 [31]. An edge in the phonological layer between words A and B means that the IPA phonological transcriptions of these words have edit distance one, in agreement with the definition of phonological word similarity adopted in other studies [39]. For instance, the words "bad" and "bat" are phonologically similar, as they di↵er in their last phoneme, and are therefore connected in the phonological layer.
3 Single-layer network metrics for linguistic networks All single layers, as defined above have isolated nodes and small connected components, i.e. in each layer some pairs of nodes exists that are not connected by paths. This is evident also from the example visualisation in panel (a) of Fig. 1 from the main text. On the other hand, the whole multiplex lexical network is connected. While the multiplex literature admits several definitions of multiplex connectedness [40,28,8], we here adopt the notion of connectedness from De Domenico et al. [19], where a multiplex network is considered connected if its aggregate is connected. As reported in Table 1 of the main text, the largest connected component of the MLN's aggregate network includes all 529 words: the full multiplex lexical network is therefore considered connected.
Notice that the MLN framework considers edge-coloured graphs as multiplex networks following the definition originally proposed within the social sciences [40,4], where no explicit inter-layer connections are considered. This is a specific case of multiplex networks, which can be generalised in order to consider inter-layer connections as well [28,8,18]. In the present MLN representation we ignore costs of jumping across layers, hence no inter-layer edges are considered. This modelling assumption leads to shortest paths over the multiplex structure coinciding with shortest paths over the aggregate network [19]. Thus a shortest path is the network path connecting any two nodes within the smallest number of hops [34].
Shortest path lengths determine, together with clustering, the so-called small-world feature in single-layer networks [41]. We follow established definitions of small-worldness from the literature about cognitive networks [38,39,7] according to which a network is a small-world if, compared to random graphs of the same size, it exhibits a significantly higher mean clustering coe cient and a comparable mean shortest path length. In the field of linguistic networks, the small-world feature has been found in semantic networks [32,35,26,38] and phonological networks [39,36] as well. As evident from the comparison to configuration models in Table 1 from the main text all the layers of the MLN display the small-world feature. As also suggested in previous literature [9,6,7,22], we conjecture that small-worldness may be cognitively beneficial to language learning and use, as it might allow for e cient navigation within semantic memory [22,9,3]. Furthermore, empirical evidence has shown that small-worldness is related to language learning in children [6]: semantic network lexicons of late talkers, who are likely to exhibit language processing di culties, do show small-worldness to a much smaller degree compared to lexicons from children learning words at normative pace.
Small-worlds are defined through the average shortest path length and clustering coe cient. It is worth reminding that the mean clustering coe cient CC 2 [0, 1] measures how much neighbourhoods resemble complete graphs. The local clustering coe cient for node i with k i neighbours and i edges between its neighbours is: As evident from the formula, disconnected nodes and nodes with only one neighbour have an ill-defined local clustering coe cient. In order to exclude them from the mean local clustering CC = ( P i CC i )/N , we use the deformed clustering coe cient CC c [27] (see Table 1 from the main text): where ✓ is the disconnectedness ratio [27], i.e. the fraction of nodes of degree 0 and 1 in the network which by definition cannot contribute to clustering. Previous work [27] highlighted the importance of considering the deformed clustering coe cient when evaluating small-worldness especially when considering small networks (i.e. networks composed of a few hundreds of nodes). This is the motivation for using the deformed clustering coe cient in this work. The other network measures reported in Table 1 in the main text are: (i) the coe cient of assortative mixing by degree a, measuring degree correlations across network edges, (ii) the fraction of nodes within the largest connected component (Conn.) and (iii) the average shortest path length hdi. Several independent studies identified single-layer linguistic networks as being significantly either more assortative [39,36,5] or more disassortative [5] when compared to random graphs: the assortativity of mixing by degree thus quantifies aspects of structure in language networks that are not present in random graphs. From a cognitive perspective, assortative (disassortative) behaviour indicates the tendency for words having larger numbers of associates/meanings/phonological neighbours (not) to be connected with each other. In the phonological layer this pattern is the consequence of word length correlating with network degree [36,37]. We conjecture that assortative or disassortative mixing by degree on the semantic MLN layers might be related to a hypothetical navigability through words in the mental lexicon. A higher than random assortative coe cient would express the possibility of reaching a well connected word directly from another well connected word. Instead, disassortative degree mixing, as observed in the association layer, would indicate the possibility of reaching directly specific, less connected terms from broader, more well connected terms.
For a more detailed review of the above network metrics we refer the interested reader to [34]. Individual network layers are compared against configuration models, i.e. random graphs with the same degree sequence (and hence the same degree distribution) as the empirical networks. This choice of null model is appropriate, as it rules out e↵ects of the degree sequence on network properties which is important since some of the empirical networks have heavy tailed degree distributions (see SI Fig. 2). Configuration models were sampled by randomising edges while maintaining degree sequences [34]. The resulting network properties, such as clustering coe cient, were tested against analytical estimates [34]. All comparisons of MLN layers in the main text are relative to these configuration models.

k-core analysis
The high deformed clustering coe cients [27], the relatively high assortativity coe cients, and the relatively flat cumulative degree distributions for degrees up to six (see SI Fig. 2) all suggest the presence of a core-periphery structure in the MLN layers. Notice that words in a densely connected core might have topological features significantly di↵erent from those of words in a poorly connected network periphery. Since we relate topological word features to word learning, it is useful to better assess the presence of strongly connected cores within the individual layers of the MLN. For this purpose we perform a k-core analysis [34]. A k-core is a maximal subset of nodes such that each node in the set is connected to at least k others in the same subset. The "maximal" feature indicates that a group of nodes is a k-core if it is not a subset of any larger group that is a k-core. Hence, operatively k-cores can be obtained by repeatedly deleting all nodes of degree less than k in a given network.
In Supplementary Table S1 we list the number of k-cores, while the size of the largest k-core for each layer and the aggregated network are visualised in Supplementary Fig. S3. For instance, the phonological network has one largest connected component and 32 other connected components (i.e. k-cores with k = 1); these smaller components are also called linguistic islands [39]. A high number of linguistic islands, and no highly connected core seems typical for phonological networks [39,36]. The association layer features a 6-core but no higher k-core. This is further indication that this layer is structurally di↵erent from the others, in that it does not display a marked core-periphery structure. In fact, from Table 1 of the main text, the association layer has a bigger largest connected component but comparable average connectivity when compared to the other MLN layers. On the other hand, the feature norms and the co-occurrence layers display k-cores up to k = 20, implying the presence of a very densely connected core (i.e. a marked core-periphery structure) in each one of these two layers. Notice that the core from the co-occurrence layer is smaller than the core of the feature norms layer and it also shrinks at a faster rate as k increases. The presence of a very densely connected core in the feature sharing layer is also indicated by the cumulative degree distribution reported in SI Fig. 2. In fact, for k < 20 the probability P (X k) of finding nodes with degree equal or higher to k stays nearly at the original value: this indicates the rarity of nodes having degrees 1  k  20 in the network. Further analysis indicates that the feature norms layer displays up to a 25-core made up of 78 words. The presence of highly connected k-cores is evidence for a rich club e↵ect [34], i.e. highly connected nodes tend to share links among themselves, particularly in the feature sharing layer. This result is in agreement with the high assortative mixing by degree exhibited by this layer and reported in Table 1 of the main text. Also the aggregated network displays k-cores but up to size k = 26 and considerably larger than those of the individual layers, again suggesting a rich club e↵ect also when the whole edge-coloured multiplex structure is taken into account.
Interestingly words being part of densely connected k-cores are acquired earlier in the word trajectory based on the empirical age of acquisition. For instance, the 90 words in the 26-core of the aggregate network are ranked 30 ± 1 positions above the average position of ensembles of 90 randomly selected words in the empirical age of acquisition word ranking. Hence, when the normative age of acquisition ordering is considered, words in the densely connected core of the MLN structure are learned earlier than expected at random. This finding corroborates the idea of an interplay between topological features of words in the MLN and patterns in word acquisition. Further, since words in a densely connected core tend to be learned earlier but in general also display higher degree and higher closeness centrality [34], we focus on these two network features in the main text when exploring word acquisition patterns through ordering experiments.

Multiplex network metrics
Multiplex networks are convenient for considering multiple interactions within the same representation [40,16,15]. Nonetheless, it is important to quantify whether considering all these interactions jointly is necessary or redundant in terms of topological patterns and interpretation of results. Structural reducibility [16] investigates whether the multiplex paradigm is a suitable model, identifying the presence and extent of redundant topological patterns. In order to assess the benefit of the multiplex representation, we adopt the greedy procedure suggested in [16] and implemented in muxViz [17]. This structural reducibility analysis relies on: (i) identifying topologically similar layers, (ii) aggregating layers if appropriate, and (iii) comparing the richness in topological patterns of the aggregated multiplex layers against the aggregated network, obtained by projecting all the edges in the multi-layer structure on a single-layer network.
The procedure identifies similarity of layers and quantifies how distinguishable the multiplex is against aggregate versions of two or more layers. The whole procedure is based on the Von-Neumann entropy of each multiplex layer [16]. This measure is used for determining the richness of topological patterns of each multiplex network aggregation. Let us consider a multiplex network with M layers. In a given aggregation stage some of the layers might be aggregated, so that the multiplex network might count X  M layers. A relative entropy function q of an aggregation stage where X multiplex layers are distinct is computed as: where h ↵ is the Von Neumann entropy of layer ↵ while h A is the Von Neumann entropy of the aggregated network. The higher the relative entropy q the more distinguishable the multiplex at the aggregation stage a is as compared to the aggregate network. The maximum value of q identifies the aggregation stage that is the most distinguishable compared to the aggregate network. In Fig. 1 (e) of the main text we show that for the MLN the maximum of q is reached when all the layer are kept as separate and no aggregation takes place. This means that the MLN is irreducible, i.e. aggregating any of its layers decreases the information on the topological patterns encapsulated in the multiplex structure. For the mathematical details behind the structural reducibility analysis we refer to [16].
In the performed ranking experiments we adopt single-layer network measures such as degree, closeness and betweenness centralities and their multiplex counterparts. Here we give a brief technical overview over these measures. The multidegree [18] or overlapping degree [8] m i of node i is defined as the sum of all the degrees of the node replicas across the M = 4 layers of the multiplex: where k ( i ↵) is the degree of node i on layer ↵. The multidegree provides partial information about the connectivity of a node across the multiplex structure and it is one of the simplest and most frequently used measures in the relevant literature [8,18,4].
In single-layer networks, the closeness centrality c i measures the distance of node i from all other nodes in the network: where d ij is the shortest distance between nodes i and j. Disconnected nodes have distance infinity and thus do not contribute to the sum. Closeness relates to how fast information is expected to spread from a given node to others in the network. In the main text we consider the generalisation of closeness centrality where multiplex shortest paths are considered (i.e. we allow for "jumps" across layers in order to quantify how close a node is to another one). Notice that this multiplex version of closeness centrality corresponds to the closeness centrality on the aggregate network. In a more general setting one could consider costs for jumps between layers and might wish to introduce more general closeness centrality measures on multiplex networks (cf. [19]). Our choice of identifying multiplex closeness with the closeness of the aggregate is motivated by simplicity of the modelling approach and di culties of identifying costs of transitioning between layers. This weight of transitions between layers may be related to cognitive functioning but it is a methodologically open question in the literature [1,42]. Notice that closeness centrality is a problematic measure on disconnected networks [34]. In our case closeness centrality is defined on a fully connected multiplex network, so that biases due to disconnectedness are not present when all the layers are considered at once. We also consider multiplex shortest paths in the multiplex version of the betweenness centrality b i which quantifies the extent to which a node lies on the shortest paths between other nodes. For a mathematical definition we refer to [34]. In SI Sect. 6 and 7 we report on the performance of word betweenness for predicting word learning.
Another measure tested in our ordering experiments is the PageRank centrality, which represents the likelihood that a random walker arrives at any particular node navigating a given network by hopping through links. Rather recently versatile Page-Rank [20] was introduced as a multiplex generalisation of PageRank, where random walkers can also transition across layers. For the mathematical details behind the formulation of the measure, we refer to [18] and [20]. In SI Sect. 6 we report on the performance of PageRank for predicting word learning.

Random word guessing
To evaluate predictions of word acquisition we compare against the reference model of randomly guessing the words to be learned, i.e. the baseline of random permutations of the empirical data set for the word acquisition trajectory. As each word has the same probability of occurring at each position, the probability that a word is correctly guessed as having been learned by time t is p w = t/N . Hence, the average number of correctly guessed words at time t is hn t i = t 2 /N . Notice that the number of randomly guessed words hn t i approaches the total number of words for larger t, i.e. the word gain measure (see Methods from the main text) becomes less sensitive at later learning stages. The standard deviation t for random word guessing is: t is one of the three sources of variation a↵ecting ordering experiments presented in the Results section in the main text. The second source is related to shu✏ing words tied in ranking positions and the third is related to the probabilistic reshu✏ing of words learned at each month according to normative CDI data. These sources of variation led to error margins on the vocabulary normalised word gains of the order of magnitude of ⇡ 10 3 , which are roughly the size of dots reported in Figure 2 from the main text and thus not displayed in that plot for clarity.
Notice that the number of randomly guessed words hn t i represents the expected overlap of random orderings with the normative age of acquisition orderings. In the main text we use hn t i as a reference value for defining both the vocabulary normalised word gain (see the Methods section) and the Z-scores of the overlaps observed through our ordering experiments. We approximate the binomial distribution relative to hn t i and t according to the commonly used constraint that p w t and (1 p w )t have to be both greater than 5 [24]. This implies that we can approximate hn t i and t as coming from a Gaussian distribution when at least t = 60 words have been acquired. When valid, this approximation allows for us to use the Z-score statistical testing at a significance level of 5% when at least 60 words have been acquired. The black dashed line in Fig. 2 (b) from the main text marks this region and we adopt it also for indicating the end of the VELS.

Tested word orderings and word gains
In the main text we only reported detailed results for word rankings based on a subset of all measures explored. Additional details are given in Supplementary Fig. S4 for word gains for orderings based on degree, word frequency, closeness, word length, betweenness and PageRank. Average word gains are reported on the right end of the plot for each ordering. Notice that the highest word gain is relative to the ordering based on the association degree, which correctly predicts an average of about 19 words as learned on top of the expectation from random guessing. The maximum word gain obtained by the association degree is 42 ± 1 words.
As reported in the main text, adding degrees across di↵erent layers together does not guarantee better word gains. In fact the multidegree reaches a lower maximum word gain compared to the association degree only (34 ± 1 vs 42 ± 1). Error bounds are estimated over di↵erent normative age of acquisition orderings.
In the main text we tested the ordering based on word frequency computed from the childdirected speech data of the CHILDES dataset [29]. However we computed word frequency on an additional, independently obtained dataset, namely the Opensubtitle dataset [2], which is based on TV subtitles. As reported in Supplementary Fig. S4, the Freq. (Adults) ordering based on Opensubtitle performs much worse (average word gain of 0 ± 2) compared to the frequency counting of the CHILDES dataset (average word gain of 17 ± 2). We conjecture this large gap in performances is due to the di↵erent nature of the two frequency datasets. While the CHILDES data set is relative to children speech, the Opensubtitle one is based mainly on TV-series written by and targeted to adults. Unsurprisingly, the lack of predictive power of the adult frequencies suggests that frequency counting from adults is a poor estimator of the word learning dynamics in children.
Word betweenness of the individual layers does perform worse than degree on all the MLN layers. This is not surprising as in all but the association layer several nodes are disconnected and hence have betweenness 0. However, when inter-layer paths are considered, the betweenness centrality computed on the multiplex structure performs noticeably better, with a maximum word gain of 39 ± 2. This finding is in agreement with the structural reducibility analysis from the main text: it supports the importance of considering the whole multiplex structure when investigating patterns among words in the modelled mental lexicon.
Single-layer PageRank for the association layer gives performances close to the degree. This is not a surprise, as PageRank and degree are correlated with each other [34]. However, di↵erently from what happens with the degree, the multiplex version of PageRank for the whole MLN provides predictive power similar to the best single-layer counterpart (i.e. the PageRank in the association layer). This indicates that random walks on the whole MLN are similar to random walks on the association layer, while hub nodes in the association layer might be very di↵erent from hub nodes in other layers, so that multidegree and degree in the association layer di↵er substantially. From a cognitive perspective, the relatively good performance of PageRank confirms previous results about the importance of this measure in exploring the mental lexicon [23].
The word gain was used in the main text for defining the vocabulary normalised word gains (see the Methods section from the main text). A vocabulary normalised word gain of X has to be interpreted as the percentage of words correctly guessed as learned by a given ordering in the whole vocabulary, in addition to those that could be learned at random. For instance, when the vocabulary is made of 100 words, an ordering identifies 24 but 4 would be guessed at random, then the vocabulary normalised word gain would be 0.2 or 20%. Supplementary Tab. S2 reports the average relative word gains for many of the tested orderings.

Ordering Experiments for the Phonological Layer
Supplementary Fig. S4 and Tab. S2 both demonstrate that the phonological layer does not perform well in terms of predicting the order in which words are learned. In Supplementary Fig.  S5 we report the word gain Z-scores (see Methods from the main text) for the two estimators that work best in our analysis, namely word degree and closeness. When we consider the phonological layer as isolated from the other MLN layers and rank the words according to their degree (closeness) in the phonological layer we obtain orderings whose word gains are compatible with random fluctuations in terms of 2 standard deviations over almost the whole learning trajectory. Hence, within a confidence interval of 95% we can say that degree (closeness) contributions coming from the phonological layer are compatible with random fluctuations. This is why the phonological layer does not perform well in the ordering experiments and in the optimisation experiments as well (see main text).
From a cognitive perspective our technical finding suggests that young toddlers do not use phonological similarities for boosting the likelihood of learning specific new words. This result is supported from recent empirical evidence with children of 18 months of age [14]. As discussed also in the main text, di↵erent measures for capturing how similar sounding words are processed by children are required. Almost all the retrieved Z Scores are compatible with overlaps provided by random guessing, within the domain of Z score 1.96  Z  1.96. Notice that after roughly 300 words are learned the two orderings provide the same average performances because the phonological layer features 268 disconnected words, i.e. words having degree 0 and closeness 1. When ranked, those words for which there is missing topological information become a tie and they are randomised, so that the ranking on that tie is identical to random guessing. We believe this is the technical reason why the phonological layer does not perform well in the optimisation procedure as well.

An Expanded Phonological Layer
Previous literature has found indications that phonological neighbourhood size (i.e. degree in a phonological network) has an impact on word acquisition in young toddles up to 30 months of age [10]. We refer the interested reader to [10] also for a brief review of other works correlating phonology and lexical development in older children.
Based on the optimisation experiments reported in the main text, we cannot interpret the lack of predictive power of the particular phonological layer used in the MLN as a general lack of influence of phonological similarities on lexical acquisition. Instead, this is an indication that the phonological network induced from the 529 words considered in the study is too small (and thus too fragmented) for providing meaningful information about word acquisition. This finding is not in contrast with the irreducibility analysis: the phonological layer appears to encapsulate di↵erent edge patterns compared to other layers. Nonetheless, further analysis with ordering experiments (cf. the previous SI subsection) reveals that these patterns are not more predictive than random guessing for lexical acquisition. In order to test whether the poorly connected topology of the phonological layer is the source of this lack of predictive power we carried out additional experiments explained below.
We took inspiration from the work of [10], where the authors used a phonological network of 12000 words from adult caregivers for testing the influence of phonology on toddlers. Similarly, we considered the larger phonological network from adult native English speakers, already analysed in [36] and including almost 30000 English words. Detailed analysis of the network topology is provided in [36]. The phonological layer of the multiplex analysed in the main paper is a subset of this network. This allows us to evaluate phonological degrees and closeness values of the same 529 words from the MLN on the "extended" topology of the adults' phonological similarities. Based on these two scores we performed additional ordering experiments, which are reported in SI Fig. 6.
We find that the word gains obtained from using either degree or closeness on the extended phonological layer are statistically significant within a 97.5% confidence level during ELS. This indicates that phonology does indeed play a role on word acquisition rather early on, compatibly with previous findings from the literature [10]. However, this e↵ect reduces over time and word gains become compatible with random word guessing after 400 words have been learned. A similar result was found in [10], where the influence of phonological word degree on word acquisition vanished around the 30th month of age.
The presence of a statistically significant influence of phonology over word learning motivated us to use the word centralities in the extended phonological network for our optimisation experiments. Results are reported both in the main text (cf. Figure 3) and in SI Sect. 7. Additional analysis relative to the original phonological layer is reported in SI Sect. 8.2.

Percentage Word Gains
In order to accompany the results from Fig. 2 of the main text, we introduce also the percentage word gain P O (⌧, !, t), as the di↵erence of word overlaps O aoa (⌧, t) and O aoa (!, t) of orderings ⌧ and !, respectively, normalised by O aoa (!, t). We always consider overlaps with the normative age of acquisition orderings. Notice that when ! is a random ordering, then O aoa (!, t) = hn t i at time t. The choice of omega can be general. In formulas: The percentage word overlap quantifies the percentage of words that a given ordering provides in comparison to random guessing. For instance, a P (⌧, !, 100) = 50% indicates that when 100 words have been learned, the ordering ⌧ correctly guesses as acquired 50% words more than ordering !. Results for the case when ! is random ordering are reported in Supplementary Fig.  S7. Error bars are estimated from standard deviations of the respective orderings. In VELS and ELS, except for the first point (i.e. when 20 words have been acquired) all our tested orderings display a percentage word gain that is several standard deviations away from the reference value of 0 (Sign Test p-value < 10 5 for all the orderings). The highest percentage word gain is the one provided by multiplex closeness centrality, which predicts up to 160% more words compared to random guessing. A percentage word gain against random ordering compatible with zero would imply that the orderings would give improvements compatible with random fluctuations. This implies that all the word gains observed by ranking words according to network features or frequency are statistically not compatible with random fluctuations when at least 40 words are learned. This finding supports results coming from the Z-scores presented in the main text. Notice that while P O allows for a statistical assessment of the significance of word gains in VELS (something that Z-scores cannot do), the percentage overlap P O su↵ers from approaching the 0 reference level for all the orderings at the middle of LLS. In this phase, the Z-scores provide a clearer picture of the predictability of di↵erent orderings (see Fig. 2 in the main text). Notice that the Z-scores provide a more robust statistical testing of our findings.
Notice that we consider also another case when ! is the association degree ordering and results are reported in Supplementary Fig. S8. The multiplex closeness performs up to 25% better in ELS than association degree and the di↵erence in performance between these two orderings is several standard deviations away from 0 (Sign Test p-value < 10 5 for all the orderings). The unmatched predictive power of closeness centrality is what marks the whole early learning stage. At later stages, during LLS, the multiplex closeness and the association degree perform similarly. Interestingly, frequency performs worse than either multiplex closeness or association degree during early learning stages. Word length and multiplex PageRank perform only up to 10% better than association degree in ELS. The percentage word overlap confirms the claims from the main text that multiplex closeness outmatches all the other considered orderings during early learning stages.  Figure 2 of the main text. A percentage word gain of 100% at a given learning stage means that a word ordering leads to guessing 100% words more compared to random guessing. Error bars are based on standard deviations. All the orderings in VELS provide percentage word gains that are clearly away from 0 (Sign Test p-value < 10 5 for all the orderings).  Figure 3 of the main text. A percentage word gain of 20% at a given learning stage means that a word ordering leads to guessing 20% words more compared to association degree. Error bars are based on standard deviations and are the same size of the dots. The peak in ELS for the multiplex closeness is clearly incompatible with a 0 di↵erence (Sign Test p-value < 10 5 for all the orderings).

Influence of taxonomic relationships on word gains
Many English words refer to categories that are taxonomically organized, e.g. "horse" is a type of "animal". This taxonomic organisation results into basic, super-ordinate and sub-ordinate level object categories. Super-ordinate categories have broader semantic fields than sub-ordinate level categories. This taxonomic organization can influence the semantic relationships giving rise to free associations, feature sharing norms and co-occurrences we represented in our multiplex lexical network. Hence, explicitly quantifying how word features in individual multiplex layers correlate with a superordinate and subordinate categorisation of words in the children's vocabulary might be of relevance for understanding the mechanisms behind word acquisition in young children.
We focus here on quantifying how the taxonomic word organisation acts as a mediator variable correlating with the two most predictive word features in our framework: degree of words in the free associations and the closeness of words on the multiplex structure.
We retrieved from WordNet 3.0 (the version curated by Wolfram Research) a network of hyponymy relationships, where nodes represent words and the directed edge A ! B means that A is a type of B. This hyponymy network included roughly 350000 English words. The in-degree of a node is defined as the number of directed edges pointing to that node. In this hyponym network words with higher in-degree represent broader words, as they have more words that fall into their semantic category. For instance, considering the edges "pigeon ! bird" and "dove ! bird", then bird would have indegree 2 while pigeon and dove would have indegree 0. We would then consider "bird" as being a broader term than either "pigeon" or "dove".
We assume that in-degree on the whole hyponym network is a good proxy of the mediator variable that distinguishes super-ordinate words (high indegree) from sub-ordinate (low indegree) in the English language.
We then correlated multiplex closeness and association degree of MLN words with their indegree in the whole hyponymy network. We used the Kendall Tau to quantify correlations. Results indicate that the MLN association degree correlates with in-degree almost three times more compared to multiplex closeness (Kendall Tau of association degree ⇡ 0.20, p-value < 10 5 vs Kendall Tau of multiplex closeness ⇡ 0.07, p-value < 0.01).
The di↵erent Kendall Taus indicate that the association degree is a better proxy for detecting superordinate words (i.e. words with higher indegree in our case) compared to closeness centrality.
As discussed in the main text, we conjecture that the association degree does not perform well in VELS and ELS because it would tend to guess more super-ordinate words than multiplex closeness would. Norming studies suggest [25] that children tend to learn general level categories before super-or subordinate classes (e.g. "dog" is learned before "animal", "chair" before furniture), a fact that is not captured well by the association degree ordering. This is compatible with what we observe in our optimisation experiments (see the new Fig. 3, panels (c) and (d)), where the association layer contributes the most to optimised orderings using both degree and closeness but only outside of VELS. This network pattern suggests the emergence of a boosting e↵ect in learning the generalisations of words previously acquired during VELS and ELS. There generalisations represent super-ordinate words that are captured predominantly by the association layer and not by the others (otherwise we would expect for all the layers to keep the same influence after VELS and ELS).

Optimisation Experiments
In Supplementary Tab. S3 we report the average vocabulary normalised word gains for the optimal orderings obtained by combining degrees and closeness centralities. Further, in Supplementary Fig. S9 we report the word gain Z-scores for the optimal orderings obtained by combining degrees and closeness centralities. These results complement the visualisation in panels (a) and (b) of Fig. 3 in the main text. In combination they demonstrate that not even the optimal combination of word degrees can perform better than multiplex closeness in terms of word predictability. Also, di↵erently from the ordering experiments, for the optimised trajectories the Z-scores at the end of VELS are now clearly statistically significant. Ternary plots in panels (a) and (b) in SI Fig. S10 visualise the optimisation landscape of vocabulary normalised word gains over the remaining three layers at the end of VELS, middle of ELS, and middle of LLS. The region of optimal word gains tends to narrow and changes its location in the landscape over time, corroborating the idea that di↵erent linguistic information plays di↵erent roles at di↵erent stages of development.
Very similar results from SI Fig. S9 were obtained for the MLN with the extended phonological layer, for which the relative word gains are reported in SI Tab. S3 while the optimisation results are reported in SI Fig. S11. In panel (a), the peak in ELS for the multiplex closeness is clearly larger than the peak for optimal combinations of word degrees (Sign Test p-value < 10 5 ). Optimisation results relative to these ternary plots are reported in Figure 3 of the main text.

Linear optimisation of betweenness and local clustering
Since betweenness on the whole MLN performs slightly worse than association degree, as reported in SI Sect. 6, we also used the betweenness centralities of words within the MLN layers as a basis for scores to calculate optimal combinations of layers. We considered local deformed clustering as well, as defined in SI Sect. 3. We chose these two additional network features because they utilise the multiplex MLN structure at global and local levels.
When betweenness was used, the retrieved optimal layer importances did not change over the VELS, ELS or LLS stages. Therefore only one relative word gain curve is reported in Supplementary Tab. S3. Predictability results coming from the optimal linear combination of betweenness centralities are overall inferior to the ordering results with multiplex closeness and the single-layer association degree. Betweenness optimisation leads to similar results compared to the ordering experiments with multiplex betweenness (see Supplementary Tab. S2). The fact that betweenness provides lower predictability power than closeness on the whole MLN suggests that early learned words have higher closeness but lower betweenness. Hence they are words closer to others on the MLN structure but they are not necessarily part of many shortest paths as the MLN topology o↵ers di↵erent short-cuts for navigating through words.
In the main text we reported the results of optimisation procedures based (i) on a local network statistics such as degree and (ii) on a global network statistics such as closeness centrality. Between the global and local extremes, we also investigated the optimisation of a second-order local network statistics such as the local clustering coe cient. Predictability results are reported in Supplementary Tab. S3. The optimisation over local clustering performs worse than the empirical multiplex closeness centrality and the linear optimisation results for both degree and closeness. This suggests further that the multiplex structure is supportive in capturing normative language learning trajectories of young children.
The inferior predictability performances of local clustering and betweenness, particularly in ELS, support the idea reported in the main text that closeness centrality, and therefore network distances, capture word correlations that are fundamental and relevant to normative word learning in young children. The idea that network distances are relevant for the cognitive processes regulating the ML is supported by previous experimental evidence [12,11,32].

Non-linear optimisations
Also non-linear combinations of word scores across layers were explored and investigated. In particular, we tested convex combinations of the type:  In this section we test how correlations of closeness and degree centralities among layers in the MLN influence results of optimisation procedure. In particular, we aim to test of prediction accuracies of the same quality as reported in the main text could be obtained from (multiplex) networks in which correlations have been destroyed. We consider null models for multiplex structure by shu✏ing the multiplex network, i.e.:    It is important to note the following: • The Global Label-Shu✏ed (GLS) null model preserves the underlying distribution of global scores {S i } at a global network level, hence the name. However, it does not preserves the local multidegree or multi-closeness centralities of a given node (i.e. on the microscopic level). This model is a suitable null model because, in addition to the above global preservation of the scores, it also preserves inter-layer degree/closeness correlations (i.e. a hub in one layer might tend to be a poorly connected node in another layer).
• The Independent Label-Shu✏ed null model does not preserve the underlying distribution of global scores, since it reallocates node labels independently at random on each layer. This model preserves only the distributions of intra-layer scores {{s Using these null multiplex models, we estimate the word gains. Results from Supplementary Tab. S4 show that: 1. Even when the global distributions of scores are preserved, the optimisation procedure does not lead to results close to the optimal results for the non shu✏ed multiplex network, both results are separated by at least a factor of six.
2. Results for the ILS are compatible with a random ordering when error bars are considered.
Maintaining the distributions of scores in each individual layer is not enough to guarantee better results than random word guessing.
3. Preserving the inter-layer multiplex correlations (as in the GLS null model) is fundamental in providing better performances of the optimisation procedure.
All in all, prediction on shu✏ed multiplex networks are substantially less accurate than predictions reported in the main paper. This gives further support for the main findings of the paper, i.e. correlations on the (multiplex) are an important determinant in word acquisition.

Importance testing for the phonological layer
Previous literature has reported correlations between degrees in the phonological layer and word learning in young children (cf. [10]). In contrast, the main paper does not find a significant influence of phonological information on word acquisition. The purpose of this section is to demonstrate that in principle our methodology can discover and appropriately quantify such correlations if present in the dataset. To demonstrate this, we re-allocate word labels on the phonological layer in such a way that phonological degrees correlate with acquisition orderings to a tunable extent. This can be achieved by reshu✏ing randomly selected words in a directed manner until a desired correlation level has been achieved. Then we run optimisations for a multiplex composed of unaltered semantic layers and the reordered phonological layer. We randomise over 20 probabilistic age of acquisition orderings and perform 10 Monte Carlo robustness samplings for each age of acquisition ordering. We consider three hypothetical scenarios: a situation in which phonological degree and age of acquisition have a small correlation (Kendall Tau  In agreement with expectations, we observe that the optimisation assigns: (i) small but non-negligible weight to the phonological layer for the small correlation scenario, (ii) significant weight to the phonological layer for intermediate strength correlations and (iii) dominant weight to the phonological layer in the large correlation scenario. As shown in the main paper, in particular also the association layer correlates with intermediate strength with age of acquisition (Kendall Tau ⇡ 0.24). Hence, in particular in the intermediate correlation scenario in which correlations of the phonological and of the association layer with age of acquisition are of similar magnitude, we observe interactions between these two layers, i.e. the phonological layer first dominates and then loses relative weight when the association layer gains influence. This e↵ect is also detectable for the large correlation scenario, but here correlations in the phonological layer are always dominant.
These experiments clearly demonstrate that our methodology is able to attribute layer weightings according to the predictive power of layers for word acquisition. In particular, the methodology is in principle capable of also quantifying correlations in the phonological layer, that is, the observation of phonological layer weights in the order of 10 3 is not to be attributed to the methodology but rather to the lack of predictive power encapsulated within network statistics of the layer. Compared to the extended phonological layer we used in SI Sect. 6.1.1, the original smaller layer is much less connected and more fragmented. Both the layers are based on the same phonological similarity measure but the extended phonological layer includes more than 29000 words, while the one originally used in the MLN includes only 529 words. We therefore consider the lack of performance of the MLN phonological layer to be a matter of quantity (rather than quality).
Supplementary Figure 12: Optimal layer weights obtained from the degree optimisation where the labels on the phonological layer are reshu✏ed in order for word degrees to correlate with the age of acquisition ordering according to a Kendall Tau of 0.1 (top left), 0.3 (top right) and 0.8 (top left).