Exploratory dynamics of vocal foraging during infant-caregiver communication

We investigated the hypothesis that infants search in an acoustic space for vocalisations that elicit adult utterances and vice versa, inspired by research on animal and human foraging. Infant-worn recorders were used to collect day-long audio recordings, and infant speech-related and adult vocalisation onsets and offsets were automatically identified. We examined vocalisation-to-vocalisation steps, focusing on inter-vocalisation time intervals and distances in an acoustic space defined by mean pitch and mean amplitude, measured from the child’s perspective. Infant inter-vocalisation intervals were shorter immediately following a vocal response from an adult. Adult intervals were shorter following an infant response and adult inter-vocalisation pitch differences were smaller following the receipt of a vocal response from the infant. These findings are consistent with the hypothesis that infants and caregivers are foraging vocally for social input. Increasing infant age was associated with changes in inter-vocalisation step sizes for both infants and adults, and we found associations between response likelihood and acoustic characteristics. Future work is needed to determine the impact of different labelling methods and of automatic labelling errors on the results. The study represents a novel application of foraging theory, demonstrating how infant behaviour and infant-caregiver interaction can be characterised as foraging processes.

Step size std. dev. Standard Deviation Table S1: Red squares are for the complete dataset, automatically (LENA) labelled. Each data point corresponds to one day-long recording. Error bars represent the standard deviation**. The dark blue points (HUM in the legend) represent the same information as computed from data labelled by human listeners, while the cyan points (LENA in the legend) are the values obtained from the corresponding data as labelled by the LENA system. Note that data from infant 340 at 183 days was re-labelled by two human listeners and results from both listeners are represented. (c, d) Mean standardised amplitude of vocalisations as a function of infant age for infants and adults, respectively.
-------------*In all our analyses, acoustic measures -amplitude and log pitch -are unitless since they have been standardised. **We computed the unbiased sample standard deviation, dividing by N-1, where N is the number of samples used in computing the standard deviation.
This is in contrast to the population standard deviation, which divides by N.

Measure
Age effect Measure Age effect Infant mean pitch 0.11 (p = 0.13) Infant pitch standard deviation 0.21 (p = 0.01 ) Infant mean amplitude 0.50 (p < 0.001) Infant amplitude standard deviation -0.61 (p < 0.001) Adult mean pitch -0.12 (p = 0.13) Adult pitch standard deviation 0.31 (p < 0.001) Adult mean amplitude 0.001 (p = 0.99) Adult amplitude standard devitation -0.35 (p < 0.001) Table S2: Vocalisation acoustics as a function of infant age. βs are shown with p-values in brackets. Both were obtained from linear mixed effects models with participant ID as a random effect and infant age as fixed effect. Statistically significant results (at a significance level of 0.05) are in bold. All values reported have been rounded to two decimal points wherever possible.  Table S3: Vocalisation acoustics as a function of whether the vocalisation was preceded by response and infant age, with optional response-infant age interaction. βs are shown with p-values in brackets. Both were obtained from linear mixed effects models with participant ID as a random effect, and whether the preceding vocalisation received a response and infant age as fixed effects. Interaction between response and infant age was used as an optional fixed effect. Results from the model with the interaction term are given in columns 4, 5, and 6. Statistically significant results (at a significance level of 0.05) are in bold. All values reported have been rounded to two decimal points wherever possible. Step sizes in 2D acoustic space as a function of steps in time. βs are shown with p-values in brackets. Both were obtained from linear mixed effects models with participant ID as a random effect. Infant age and steps in time were used as fixed effects, with presence or absence of a response and infant age-response interaction as optional fixed effects, to test the effect of receiving a response and/or potential interactions between infant age and response received on how steps in 2D acoustic space correlate with steps in time. Statistically significant results (at a significance level of 0.05) are in bold. All

S3. Step size probability density fits
For each set of step sizes -infants' pitch steps following an adult response, infants' pitch steps following no adult response, etc. -we used AIC (see https://github.com/AnneSWarlaumont/infant-vocal-foraging/tree/master/Analyses/AIC_Theo_Rhodes_code for details) to determine the best fit probability density distribution type and the curve parameters. The types of distributions considered were normal, lognormal, exponential and pareto distributions. For a given type of step, we determined what type of distribution best fit the majority of day-long recordings. We then analysed distribution parameters only from those recordings for which the best fitting distribution also belonged to that of the majority best fit type, for that set of step size distribution fits.  Figure S3: Representative example of randomly selected data vs. AIC fit (Infant mww, age 75 days). The figure shows the probability distribution of steps in acoustic space for infant vocalisations where the first infant vocalisation was not followed by an adult response. The distribution derived from the data is in blue and the AIC best fit (lognormal, in this case) is shown in red. Inter-vocalisation time 2D pitch-amplitude space Figure S4: Distribution of AIC best fits for step size probability distributions for various step types. Probability distributions of step sizes along pitch (panel a) and amplitude (panel b) dimensions, respectively, are predominantly exponential, for both adults and infants. (c) Probability distributions of steps in two dimensional acoustic space are predominantly lognormal for both infants and adults. (d) Probability distributions of inter-vocalisation times are predominantly lognormal for infants, and pareto for adults with the exception that unsplit inter-vocalisation time distributions are largely lognormal for adults.   Sample size was included to control for possible co-variation between sample size (number of vocalisation step events in the recording) and distribution fits, age, and response; including sample size did not affect which results were statistically significant or their sign. Infant ID was a random effect in all models. All values reported have been rounded to two decimal points wherever possible.

(a)
As shown in Table S6 and Figure S5, we observed a significant decrease with age in the 90 th percentile value of infant amplitude step size. We also observed a significant increase with age in the λ parameter of the fitted exponential distributions. Both findings suggest that infants take shorter steps in amplitude as they get older, indicating more focused exploration. For adults' amplitude step sizes, we observed a significant increase in the median and 90 th percentile values, as well as smaller λ for the exponential fits following infant response and with increasing infant age. These findings all point to larger steps in amplitude, and thus suggest broader adult exploration, both with infant response and with increasing infant age.
Finally, Figure S6 shows how, for infant and adult vocalisations from the entire dataset (all recordings at all ages combined), the median and 90 th percentile values of amplitude steps change as a function of the number of vocalisation events by the speaker since a response was last received. Here, the step from the vocalisation that receives a response to the next vocalisation is designated as vocalisation 0, the following step is designated 1, and so on. This continues until the next response is received, at which point the count resets to 0. Thus, as shown in panels (c) and (d), there are fewer steps included in the analysis as the number of events since last response increases, and thus the estimates in panels (a) and (b) can be expected to be less stable as the number of vocalisations since last response increases. At the moment, we treat these visualisations as exploratory, leaving statistical analyses of such multi-event sequences for future work. Here, the step from the vocalisation that received a response to the next vocalisation was assigned 0, the following step (assuming no response to the second vocalisation) was designated 1, and so on. This continued until the next response was received at which point the count resets to 0. The data for infants is in blue and that for adults is in black. (b) 90 th percentile values of amplitude step sizes for the entire dataset, as a function of number of vocalisations since the last response was received (Infant, blue; adult, black). Number of vocalisation events as a function of number of vocalisations since the last response was received for infants (c), and adults (d). Note that plots were terminated after the 10 th vocalisation step following the post-response step.
S3.2 Do pitch step sizes vary with response and infant age?  Table S7: Steps in pitch at the day-long recording level as a function of recent response, infant age, and sample size: results of statistical analyses. βs are shown with p-values in brackets. Statistically significant results (at a significance level of 0.05) are in bold. Fixed effects are in rows and dependent variables are in columns. A separate linear mixed effects regression model was run for each dependent variable. Sample size was included to control for possible co-variation between sample size (number of vocalisation step events in the recording) and distribution fits, age, and response; including sample size did not affect which results were statistically significant or their sign. Infant ID was a random effect in all models. All values reported have been rounded to two decimal points wherever possible. infants; (f) shows a similar plot for adults (WR, black; WOR, green). Note that only distributions that were determined to best fit to an exponential based on AIC criterion are represented in (a), (d), (e), and (f). (g) 90 th percentiles of the infant pitch step size distributions plotted against infant age following adult response (WR, blue) and not (WOR, red); (h) shows a similar plot for adult pitch step sizes (WR, black; WOR, green). Medians and 90 th percentile values were computed based on the raw data, prior to determining best fits using AIC.
For infants, as shown in Table S7 and Figure S7, we found a significant increase following adult response in the median pitch step size, suggesting that infants are more likely to take longer steps in the pitch dimension immediately after receiving adult responses.
As infant age increased, median and 90 th percentile pitch step sizes increased and λs of the exponential step size probability density fits decreased; these three findings suggest that as infants get older, they explore more broadly in the pitch dimension.
For adults, as shown in Table S7 and Figure S7, we found a significant decrease in 90 th percentile value and a significant increase in λ parameter of the exponential fits following an infant response. Together, these results suggest that adults are more likely to take shorter steps following a response from an infant, indicating more focused exploration. We also found that median and 90 th percentile values increased and λ decreased with infant age. These findings suggest that adults take longer steps in the pitch dimension as infant age increases, suggesting more adult pitch exploration and variation as the infant develops.
S8 Finally, Figure S8 shows the median and 90 th precentile values of pitch steps as a function of the number of vocalisation events by the vocaliser since a response was last received, for infant and adult vocalisations from the entire dataset (all recordings at all ages combined). The step from the vocalisation that received a response was assigned 0, the following step (assuming no response in the meantime) was assigned 1, and so on. This continued until the next response was received, at which point the count reset to 0. Infant data are shown in blue, and adult data are in black. (b) 90 th percentile of pitch step sizes for the entire dataset, as a function of number of vocalisations since the last response was received (Infant, blue; adult, black). Note that plots were terminated after the 10 th vocalisation event after the last response.

S3.3 Do step sizes in 2D acoustic space vary with response and infant age?
See main text for additional results and discussion.
For a demonstration of how lognormal and pareto distributions change as a function of their parameters, see https://osf.io/2fuje/ (Wolfram Player may be used to view the demo). To see how the variation of the parameters of lognormal and pareto distributions in parameters regimes seen in our data per AIC best fits, see Fig All values reported have been rounded to two decimal points wherever possible. Figure S9: Representative probability distributions to demonstrate how values affect fitted step size distribution shapes. (a) has lognormal probability distributions for parameter ranges similar to those observed in AIC fits for infant and adult WR and WOR step size distributions in 2D acoustic space obtained from LENA and human-labelled data. As µ increases the peak widens while shifting to the right and the tail gets wider, making both intermediate steps and larger steps more likely. As σ increases, the peak widens while shifting to the left, and the tail widens, making shorter and larger steps more likely. (b) A similar plot for lognormal parameter ranges similar to those observed in AIC fits for WR and WOR temporal step size distributions (infants) obtained from LENA and human-labelled data. In contrast with plot (a), plot (b) uses log scales for both x and y axes, in order to better highlight the effects of differing parameter values in the ranges of interest. (c) A similar plot for pareto parameter ranges similar to those observed in AIC fits for adult WR and WOR temporal step size distributions obtained from LENA and human-labelled data. In addition, we add a reference curve at xmin = 100 to show the effect of increasing xmin. For the range of xmin values obtained from AIC best fits (∼ 1-1.3) and the range of step sizes in time present in our data, the change in xmin has no appreciable effect on the pareto distribution. In contrast, however, as α increases, the distribution decays rapidly and the likelihood of larger step sizes decrease. For parameter ranges for infant and adult WR/WOR lognormal fits of step size distributions in 2D acoustic space from LENA-labelled data, see Fig. S10; for parameter ranges for infant and adult WR/WOR step size distributions in time from LENA-labelled data, see Fig. S12; for parameter values for human-labelled data and the corresponding LENA-labelled subset, see https://osf.io/xptv4/ and https://osf.io/56gx9/. Note that the range of X-axis values used in (a), (b), and (c) correspond to the range of step size values for the data that went into the analyses reflected in Fig. S10 and S12. All probability distributions shown have been normalised such that the area under the curve from 0 to the maximum X-axis value shown, is 1.   The step from the vocalisation that received a response was assigned 0, the following step (assuming no response in the meantime) was assigned 1, and so on. This continued until the next response was received, at which point the count reset to 0. The data for infants are in blue and data for adults are in black. (b) 90 th percentile value of step sizes in 2D acoustic space for the entire dataset, as a function of number of vocalisations since the last response was received (infant, blue; adult, black). Note that plots were terminated after the 10 th vocalisation event after the last response.

S3.4 Do inter-vocalisation intervals vary with response and infant age?
See main text for additional figures and results.  Similarly, when sample size was included, the positive relationship between adult median inter-vocalisation interval and an infant response being recently received was only marginally significant. Finally, when sample size was included, we found a statistically significant negative effect in the µ parameter of infant inter-vocalisation step size distributions with respect to having recently received an adult response. All values reported have been rounded to two decimal points wherever possible.  Figure S12: Additional results of inter-vocalisation step size distribution analyses. Infant µ (a) and σ (b) plotted against infant age following adult response (WR, blue) and not (WOR, red). Only distributions that were best fit to lognormal curves based on AIC are represented. Adult xmin (c) and α (d) plotted against infant age following infant response (WR, black) and not (WOR, green). Only distributions that were best fit to pareto curves per AIC are shown. (e) shows 90 th percentile values for inter-vocalisation intervals plotted against infant age for infants (WR, blue; WOR, red). (f) shows a similar plot for adults (WR, black; WOR, green). 90 th percentile values were computed from the raw data, before finding the AIC best fit. The step from the vocalisation that received a response to the next vocalisation was assigned 0, the following step (assuming no response in the meantime) was assigned 1, and so on. This continued until the next response was received at which point the count reset to 0. Data for infants are in blue and adult data are in black. (b) 90 th percentile value of inter-vocalisation intervals for the entire dataset, as a function of number of vocalisations since the last response was received. Infant data is in blue and adult data is in black. Note that plots are terminated after the 10 th vocalisation event after the last response.
Note that unlike other measures presented in this study (mean and standard deviation of pitch and standard deviation, median and 90 th percentile values of step sizes in amplitude, pitch, and 2D acoustic space), infant inter-vocalisation intervals are longer than those of adults, based on median and 90 th percentile values, i.e., infants vocalise more sparingly than adults. This observation is supported by Table S5. This disparity in median and 90 th percentile values is in contrast to all other measures presented, which are comparable for both infants and adults. S14 S4. Using data re-labelled by human listeners to check the validity of automatically labelled data      1 and 3 is shown, with L1 labels as the known class (row indices) and L3 labels as the predicted class (column indices). For a description of human-listener labels, refer to the Methods section of the main text. Note that we see high agreement between L1 and L3 labels, which is in agreement with the high inter-rater reliability scores for listeners 1 and 3 for data from infant 340 (see Table 3, main text). S15

S4.2 Acoustic space trajectories and step size distributions of infants and adults
In Figures S14, S15, S16, and S17, we present data from three infants at different ages re-labelled by human listeners. For comparison, we also present the corresponding data (i.e. from the same day-long recording) as labelled by the LENA software. In addition to comparing how data labelled by the LENA software compares to the same data labelled by human listeners, we have one recording (participant 340, age 183 days) labelled by two different human listeners to compare how differences in labelling by different listeners affect the data.
In each of the four figures, panel (a) shows infant vocalisations' locations in 2-D acoustic space and panel (b) shows adult vocalisations' locations in the same space. The data are depicted as a series of directed vectors from vocalisation i at a location in the acoustic space given by the ordered pair (fi, di) to vocalisation i + 1 at (fi+1, di+1), starting from the first available vocalisation based on the data. Here, f is the z-scored log pitch, and d is the z-scored amplitude. We see that the without response (WOR) steps' distributions are extremely similar when LENA's labels vs. human re-labelling are used. On the other hand, for with-response steps, we see larger differences between the distribution fits (shown in dashed lines in plots e1-e4 and f1-f4) for human-labelled vs. LENA-labelled data. We also see that the raw distributions are much less smooth for human-labelled with-response data. A likely reason for these discrepancies is the paucity of with-response data in the human-labelled dataset (see Table S13). This in turn may perhaps be due to human listeners' greater sensitivity to voices, so that segments labelled by humans may have been more likely to be identified as containing multiple human voices and therefore excluded from analysis. Additionally, exclusion of many of segments of the audio (based on LENA's automatic segmentation and classification) from the human labelling task made it unlikely that relevant voices that were missed by the LENA system would have been included in the human analysis. For all these reasons, responses were less likely to be identified within the humanlabelled datasets. For quantitative results on how step size distributions based on human-labelled data compare to those based in corresponing LENA-labelled data, see Table S14.  Table S13: Number of data points in WR and WOR step sizes for human re-labelled data and corresponding LENA data. The number of data points for both WR and WOR step sizes for all three datasets that were re-labelled by human listeners are shown. These numbers are reported for adult vocalisations (Ad) and infant vocalisations (Ch), for both human re-labelled data (HUM) and the corresponding LENA data (LENA). The LENA-labelled data include many more steps for all types except child steps in which the first vocalisation was not followed by an adult response -we find that the number of steps for this category is comparable for both human-labelled data and corresponding LENA-labelled data.  Table S14: Two-sample Kolmogorov-Smirnov (KS2) test results, comparing step size distributions from humanlabelled data and corresponding LENA labelled data. frej represents the fraction of tests which failed to reject the null hypothesis (that data from the two samples -step size distriution from data labelled by LENA and human listeners -are drawn from the same distribution) at the 0.05 significance level, for individual datasets (column 1). For each category specified in columns 3, 5, and 7, we present the mean p-value from the KS2 test in columns 4, 6, and 8, respectively. The standard deviation of the p-value in parentheses. For example, for data from infant 274 at 82 days labelled by listener 1, 83 percent of KS2 tests (performed on unsplit, WR, and WOR step size distributions in pitch, amplitude, 2D acoustic space, and time, for infants and adults) failed to reject the null hypothesis. Further, for all unsplit step size distributions (pitch, amplitude, 2D acoustic space, and time) where the infant was the vocaliser, the mean p-value associated with the KS2 tests performed was 0.37, with a standard deviation of 0.35. By and large, we see that frej is high for all datasets except infant 530 at 95 days labelled by listener 2. Note that frej is lowest for data from infant 530 at 95 days labelled by listener 2, followed by data from infant 274 at 82 days labelled by listener 1, both of which have the lowest reliability scores (  Table S15: Goodness of AIC fits (human-labelled data and corresponding LENA-labelled subset). The means and standard deviations of the R 2 value of the AIC best fit for different step size distribution types are shown. All results are from data labelled by human listeners and the corresponding LENA-labelled subset. The step size distributions are organised by whether they were computed from data where the vocaliser was an infant or adult (column 1), and whether they are WOR, WR, or unsplit distributions (column 2). For mean and standard deviation for R 2 values for each step size distribution type for a category (eg. WR pitch step size distributions of adult vocalisations), see https://osf.io/53amv/. For a breakdown of the majority best fit for each distribution type, see Fig. S4. The sixth column has the mean number of observations per distribution for that category while the seventh column has the total number of distributions that went into calculating the mean and standard deviation of R 2 values for that category. For example, the first row of the table gives the mean and standard deviation of all unsplit step size distributions (pitch, amplitude, 2d acoustic space, and time) where the vocaliser was an infant as labelled by human listeners, regardless of best fit type. For this category, each distribution on average had 684 observations, and 16 distributions were used to calculate the mean and standard deviation R 2 values. All values reported have been rounded to two decimal points wherever possible. R 2 values typically fall between 0 to 1, with values closer to 1 indicating better fits. Note that one possible reason for lower R 2 values could be that some step types were less prevalent and therefore had fewer steps on which to fit the distribution (see the sixth column of the table). . Raw and fitted probability distributions of step sizes from human labelled data and corresponding LENA labelled data are shown in panels (c1) through (f8). Infant data for step size distributions following a response (WR) are in panels c1-c4 and infant data for WOR are in panels c5-c8. Adult WR data are in panels d1-d4, and adult WOR data are in panels d5-d8. Fits for Infant WR data are in panels e1-e4 and infant WOR fits are in panels e5-e8. Adult WR fits are in panels f1-f4, and adult WOR fits are in panels f5-f8. The (f) in the legend indicates fitted as opposed to raw data (indicated by (d) in the legend). Note that both human-labelled infant and adult WR and WOR data/fits are given by pink and grey solid/dashed lines and are presented in the same subplots as their corresponding LENA-labelled data/fits, to allow for visual comparison of how well the curves do, or in a few cases do not, overlap. , and as labelled by the LENA software (WR in black and WOR in green; right). Raw and fitted probability distributions of step sizes from human labelled data and corresponding LENA labelled data are shown in panels (c1) through (f8). Infant data for step size distributions following a response (WR) are in panels c1-c4 and infant data for WOR are in panels c5-c8. Adult WR data are in panels d1-d4, and adult WOR data are in panels d5-d8. Fits for Infant WR data are in panels e1-e4 and infant WOR fits are in panels e5-e8. Adult WR fits are in panels f1-f4, and adult WOR fits are in panels f5-f8. The (f) in the legend indicates fitted as opposed to raw data (indicated by (d) in the legend). Note that both human-labelled infant and adult WR and WOR data/fits are given by pink and grey solid/dashed lines and are presented in the same subplots as their corresponding LENA-labelled data/fits, to allow for visual comparison of how well the curves do, or in a few cases do not, overlap.  Figure S16: Human labelled data and corresponding LENA labelled data -Participant 340 at 183 days old; Listener 3 (a) Acoustic space traversed by the infant as labelled by human listener L3 (WR in pink and WOR in grey; left), and as labelled by the LENA software (WR in blue and WOR in red; right). (b) Acoustic space traversed by the adult as labelled by human listener L3 (WR in pink and WOR in grey; left), and as labelled by the LENA software (WR in black and WOR in green; right). Raw and fitted probability distributions of step sizes from human labelled data and corresponding LENA labelled data are shown in panels (c1) through (f8). Infant data for step size distributions following a response (WR) are in panels c1-c4 and infant data for WOR are in panels c5-c8. Adult WR data are in panels d1-d4, and adult WOR data are in panels d5-d8. Fits for Infant WR data are in panels e1-e4 and infant WOR fits are in panels e5-e8. Adult WR fits are in panels f1-f4, and adult WOR fits are in panels f5-f8. The (f) in the legend indicates fitted as opposed to raw data (indicated by (d) in the legend). Note that both human-labelled infant and adult WR and WOR data/fits are given by pink and grey solid/dashed lines and are presented in the same subplots as their corresponding LENA-labelled data/fits, to allow for visual comparison of how well the curves do, or in a few cases do not, overlap.  Figure S17: Human labelled data and corresponding LENA labelled data -Participant 530 at 95 days old; Listener 2 (a) Acoustic space traversed by the infant as labelled by human listener L2 (WR in pink and WOR in grey; left), and as labelled by the LENA software (WR in blue and WOR in red; right). (b) Acoustic space traversed by the adult as labelled by human listener L2 (WR in pink and WOR in grey; left), and as labelled by the LENA software (WR in black and WOR in green; right). Raw and fitted probability distributions of step sizes from human labelled data and corresponding LENA labelled data are shown in panels (c1) through (f8). Infant data for step size distributions following a response (WR) are in panels c1-c4 and infant data for WOR are in panels c5-c8. Adult WR data are in panels d1-d4, and adult WOR data are in panels d5-d8. Fits for Infant WR data are in panels e1-e4 and infant WOR fits are in panels e5-e8. Adult WR fits are in panels f1-f4, and adult WOR fits are in panels f5-f8. The (f) in the legend indicates fitted as opposed to raw data (indicated by (d) in the legend). Note that both human-labelled infant and adult WR and WOR data/fits are given by pink and grey solid/dashed lines and are presented in the same subplots as their corresponding LENA-labelled data/fits, to allow for visual comparison of how well the curves do, or in a few cases do not, overlap.

S21
S5. What vocalisation acoustics and changes in vocalisation acoustics predict responses?  Table S16: Which vocalisation patterns received responses: results of statistical analysis with patterns of change in acoustics included. βs are given with p-values in brackets. Significant results (at a significance level of 0.05) are in bold. The 'step' variables are the step sizes from the preceding vocalisation to the vocalisation in question. Note that acoustic step sizes are non-directional and may represent either increasing or decreasing amplitude or pitch. Infant ID was a random effect in all models. All values reported have been rounded to two decimal points wherever possible.