Acoustic and language-specific sources for phonemic abstraction from speech

Spoken language comprehension requires abstraction of linguistic information from speech, but the interaction between auditory and linguistic processing of speech remains poorly understood. Here, we investigate the nature of this abstraction using neural responses recorded intracranially while participants listened to conversational English speech. Capitalizing on multiple, language-specific patterns where phonological and acoustic information diverge, we demonstrate the causal efficacy of the phoneme as a unit of analysis and dissociate the unique contributions of phonemic and spectrographic information to neural responses. Quantitive higher-order response models also reveal that unique contributions of phonological information are carried in the covariance structure of the stimulus-response relationship. This suggests that linguistic abstraction is shaped by neurobiological mechanisms that involve integration across multiple spectro-temporal features and prior phonological information. These results link speech acoustics to phonology and morphosyntax, substantiating predictions about abstractness in linguistic theory and providing evidence for the acoustic features that support that abstraction.

are in turn significantly longer than coronal taps, each at the p ≤ 0.001 level.However, /d/ and /t/ taps do not differ significantly from one another in duration (p = 0.9).Thus, on the basis of their duration, coronal taps are distinct from coronal stops yet not distinct from one another on the basis of their phonemic identity, satisfying the study's foundational assumptions.
Nevertheless, the stimuli used in this study were taken from natural speech and undoubtedly vary along more acoustic dimensions than duration.As a means of further ensuring that the acoustic differences between coronal stops is greater than the difference between coronal taps, a measure of spectrographic distance was calculated for coronal stops and taps.First, wide-band spectrograms were created for all stimuli using Praat's default settings, and the spectrographic segments corresponding to coronal stops and taps were excised using the segmentation and labelling given in the Buckeye Corpus.The median duration of these segments was calculated (in bins) and all spectrographic segments were resized to this duration using the resize function from the transform module of the Python library scikit-image. 10This function performs bi-linear interpolation to resize the image.Prior to downscaling any images, it applies a Gaussian filter with a kernel size of (s − 1)/2, where s is the downscaling factor, to prevent any anti-aliasing artifacts.For each pair of resized spectrograms, the sum of the squared pixel-wise difference was calculated, and divided by the size of the spectrogram in pixels.This process is shown schematically in Supplementary Figure 1d.Supplementary Figure 1c plots the distances of [t], [d], and /d/-tap spectrograms from /t/-tap spectrograms (green), as well as the distances for a random split of all coronal taps (yellow).A one-way ANOVA indicated that there was a significant effect of phonemic category on spectrographic distance from /t/-tap [F(3; 339,842)=9124.47,p ≤ 0.001], and post hoc comparisons using the Tukey HSD test indicated that the distance between /t/-tap and /d/-tap is significantly less than the distance between /t/-tap and either of the coronal stops at the p ≤ 0.001 level.However, when mean distance is calculated for a random split of taps not based on phonemic identity, it is not significantly different from the mean distance calculated based on the phonemic identity of the taps (p = 0.44).Once again, this shows that the phonemic identity of coronal taps cannot be determined from the acoustic properties of taps themselves.
Although taps derived from /d/ and /t/ are acoustically indistinguishable from one another, the length of the vowel that precedes a tap systematically varies in duration based on the phonemic identity of the tap.When preceding a tap derived from /d/, vowels are approximately 10% longer than those preceding a tap derived from /t/. 1,2,9,11 This pattern is also observed in the Buckeye Corpus, where vowels preceding taps derived from /d/ were on average 18.31ms (SD= [13.23, 23.38]) longer than vowels preceding taps-derived from /t/ [t(1309)=50.16,p ≤ 0.001] (Supplementary Figure 1b).Thus, there does exist an acoustic cue to the phonemic identity of taps.On this basis, one could argue that any observed difference in the neural response to medial /d/ and /t/ is due to the duration of the preceding vowel, seemingly undermining the foundational assumption of the study.For this reason, this study focuses on the 500ms following the onset of coronal stops and taps.By timelocking the analyzed response to the beginning of closure and using the preceding 100ms to baseline the signal, the acoustic impact of the preceding vowel is effectively neutralized.Though it remains possible that the duration of the preceding vowel cues the listener to the phonemic identity of the following tap, differences observed in the response /d/ and /t/ taps must be based on their categorization, and not the acoustics of the preceding vowel itself, because the mean signal in the 100ms preceding the tap is subtracted from the analyzed response, and /d/ and /t/ taps themselves do not differ acoustically from one another.In other words, though vowel duration may cue the phonemic identity of the tap, by timelocking the response to the beginning of the tap, any difference in response cannot be reducible to the preceding acoustic context.

Supplementary Note 2
Speech responsive electrodes were defined independently for each band, and across the ten subjects, Supplementary Figure 2. Degree of overlap across bands for channels defined as speech responsive.Lower dot matrix plot indicates the combination of bands being considered, and the upper bar plot shows the number of speech responsive channels shared in common for that combination of bands.The left most column (black bar) shows the number of electrodes that were not responsive to speech for any band, and color indicates the number of bands in each combination.
1,215 (SE±28.1)electrodes were found to be speech responsive for at least one band, with an average of 485 electrodes found to be speech responsive for each band.While some overlap was observed in the sites categorized as speech responsive across bands, the majority of speech responsive sites were speech responsive for only one band.This can be seen in Supplementary Figure 2. Similarly, significant sites for the tap, regular past tense, and regular plural comparisons were overwhelmingly band-specific (Supplementary Figure 3).

Supplementary Note 3
The null distribution for each of the neural response bands was generated from the comparison of random pairs of phones (i.e., A, B) as described in Results Section 'Acoustics, phonology, and morphology drive neural activity' and Methods Section 'Significant Electrodes'.However, gross featural and environmental properties of these randomly chosen pairs differ from those of the linguistically meaningful pairs in ways that could impact the suitability of the generated distribution as a null distribution for assessing the significance of the linguistically meaningful pairs.In this section, we confirm that distributions of randomly chosen pairs of sounds with featural and environmental properties similar to those of the linguistically meaningful pairs are not distinguishable from the overall null distribution.In doing so, we confirm that the distribution generated from random pairs of phones is a suitable null distribution.
The first property of the randomly chosen pairs that distinguishes them from the linguistically meaningful pairs is that the randomly chosen pairs are on average more featurally different from one another than the linguistically meaningful pairs.In other words, the large numbers of significant sites for the linguistically meaningful comparisons could be driven by the phonological similarity of the comparisons /t/ vs /d/ and /s/ vs /z/, each of which differ only in the feature [±voice].If featural similarity drove the significance of the linguistically meaningful comparisons, we would expect the significance counts for random pairs (A,B) with fewer featural differences to be closer to the outer edge of the null distribution than the counts for random pairs with more featural differences.However, this is not the case.As shown in Supplementary Figure 4 the distribution of significant counts for random pairs with a single feature difference is not meaningfully distinct from the distribution of significant counts for random pairs with ten feature differences.The second property of the randomly chosen pairs that distinguishes them from the linguistically meaningful pairs is the consistency of their phonological environments.That is, while all sounds considered in the plural and past tense comparisons were word-final sounds, pairs in the null distribution were not position-restricted.If the consistency in the phonological environment surrounding the phones participating in the tap, plural, and past tense alternations drove the number of significant sites observed for those comparisons, then we would expect comparisons of random phones in consistent phonological environments to accompany similarly large numbers of significant sites.To assess this, we performed the A, B x , B y analysis on 25 pseudo-random pairs of word-initial phones (i.e., all word-initial [z] vs. all word-initial [s]) and 25 pseudo-random pairs of word-final phones (i.e., all word-final [n] vs. all word-final [m]).Again, the distributions of randomly-chosen phones with consistent phonological environments falls well within the overall distribution of all randomly-chosen phones, as shown in Supplementary Figure 5.

Supplementary Note 4
In the main body of the manuscript, for a site to be considered an acoustic site for the coronal tap comparison, there must have existed at least one time window with a significant difference between surface [t] and tap /t/ tokens and between surface [t] and tap /d/ tokens but no significant difference between tap /t/ and tap /d/ tokens.For a site to be considered a phonemic site, there must have existed at least one time window with a significant difference between tap /d/ and tap /t/ tokens and between tap /d/ and surface [t] tokens but no significant difference between surface [t] and tap /t/ tokens.We say that these comparisons have a "/t/ anchor" because they compare two kinds of /t/ with one kind of /d/.
A priori, the /t/ anchor comparison was chosen for analysis because surface realizations of /t/ (e.g., [t], [t h ]) are generally more acoustically distinct from taps than surface [d], and we wanted it to be particularly unlikely that underlying sites for the tap comparison (those grouping [t]/[t h ] with /t/-taps) could be explained away as another kind of acoustic similarity.However, the comparisons could also be done with a /d/ anchor.Then, for a site to be considered an acoustic site for the coronal tap comparison, there must have existed at least one time window with a significant difference between surface [d] and tap /d/ tokens and between surface [d] and tap /t/ tokens but no significant difference between tap /d/ and tap /t/ tokens.For a site to be considered a phonemic site, there must have existed at least one time window with a significant difference between tap /t/ and tap /d/ tokens and between tap /t/ and surface [d] tokens but no significant difference between surface [d] and tap /d/ tokens.The difference between /t/-anchor and /d/-anchor comparisons for the coronal tap alternation are shown in Supplementary Figure 10.Distributions of significant sites for all neural response bands for nine participants.a Locations of acoustic (pink) and phonemic (blue) sites identified by the coronal tap alternation.b Locations of surface (green) and morphological (orange) sites identified by the regular past tense alternation.c Locations of surface (purple) and morphological (gold) sites identified by the regular plural alternation.All subfigures were created using the Python package nilearn (DOI: 10.5281/zenodo.8397156).
spectrograms that had been compressed by a factor of two using a Generative Adversarial Interpolative Autoencoder (GAIA) 13 that had been trained on the original spectrograms.
For each participant, for each of the seven neural response types (=six classic frequency bands and broadband LFP), seven LME models were fit.Each model fit neural response with electrode channel and excerpt speaker as random effects and either only spectrographic features, only phonemic label features, or both spectrographic and phonemic label features as fixed effects.Supplementary Figure 11 illustrates how these three base models where constructed.Four additional models were also created, in which the phonemic or spectrographic features had either been shuffled within each excerpt or shuffled across the full recording session.Models were then compared within participant and neural response type using the Akaike Information Criterion (AIC), 14 and all best-fit models carried 100% of the cumulative model weight and had an AIC score >200 lower than other models.
Broadband LFP as well as power in delta, theta, alpha, and beta bands were best fit by linear mixed effects models that included both spectrographic features and phonemic labels (s1p1).For these bands, the s1p1 model was the best fit for all participants.These results suggest that power in these bands is driven in part by phonemic category information that is not reducible to speech acoustics.For power Supplementary Figure 11.LME approach models neural activity as a combination of spectrographic and phonemic label features.For each stimulus waveform, spectrograms are computed and time-aligned phonemiclevel transcriptions are assigned (top row).Spectrograms are compressed into 128-dimensional vectors using a generative adversarial autoencoder network, and transcriptions are one-hot encoded.Three classes of model are created from these features: a purely spectrographic model (left column), a purely phonemic label model (right column), and a model containing both feature sets (middle column).For each band of neural activity, mixed effects models are fit.Model weights are used to reconstruct a predicted response for each band, and the correlation between the predicted and recorded neural response is calculated.
in gamma and high-gamma bands, however, the best fit model varied across individuals.For gamma power, eight participants' data were best fit by the model that included only spectrographic features (s1), and two participants' data were best fit by the s1p1 model that included both spectrographic and phonemic label features (SD013, SD018).Similarly, for high-gamma power, eight participants' data were best fit by the s1 model that included only spectrographic features, and the remaining participants data were best fit by the s1p1 model that included both spectrographic and phonemic label feature sets (SD011, SD013).These results suggest that power in frequencies above 30Hz are primarily driven by speech acoustics rather than phonemic category information.
For each subject and response type, the model with the second lowest AIC score was >100 times as likely as the third ranked model.Models containing only phonemic label features were most likely to be the second ranked model for lower frequency bands.However, with increasing band frequency, the s1 model becomes more likely to provide a better fit for the data until, for gamma and high-gamma bands, it is the best fit model overall.These results provide support for the generalization that phonemic labels better explain power at lower frequencies, while acoustic features excel at explaining power in higher frequencies.

Supplementary Figure 3 .
Degree of overlap across bands for channels identified as significant for each of the three (morpho)phonological comparisons.For each subfigure, lower dot matrix plots indicate the combination of bands being considered, and upper bar plots show the number of speech responsive channels shared in common for that combination of bands.Combinations of bands not indicated in the dot matrix plots shared no significant sites in common.a Number of sites identified as acoustic (pink), phonemic (blue) or both (red) by the coronal tap alternation.b Number of sites identified as surface (green), morphological (orange), or both (dark green) by the regular past tense alternation.c Number of sites identified as surface (purple), morphological (gold), or both (dark purple) by the regular plural alternation.Supplementary Figure 4. Distance in feature changes between phones does not structure the null distribution.Number of significant sites observed for random pairs differing in one phonological feature 12 (red) and ten phonological features (blue) for each neural response band, relative to the remainder of the generated null distribution (gray).Vertical axes indicate the number of sites selective for surface identity observed for each comparison, while horizontal axes indicate the number of sites selective for underlying identity observed for each comparison.Dashed gold lines delimit the boundary containing 95% of the null distribution.

Supplementary Figure 5 .
Consistency in phonological environment does not structure the null distribution.Number of significant sites observed for 25 random pairs restricted to word-initial position (red) and word-final position (blue) for each neural response band, relative to the remainder of the generated null distribution (gray).Values have been jittered by <1 unit so that all unique values are visible.Vertical axes indicate the number of sites selective for surface identity observed for each comparison, while horizontal axes indicate the number of sites selective for underlying identity observed for each comparison.Dashed gold lines delimit the boundary containing 95% of the null distribution.