Disentangling sound from syntax: electrophysiological analysis of linguistics expressions

Syntax is a species-specific component of human language combining a finite set of words in a potentially infinite number of sentences. Since words are by definition expressed by sound, factoring out syntactic information is normally impossible. Here, we circumvented this problem in a novel way by designing phrases with exactly the same acoustic content but different syntactic structures depending on the other words they occur with. By performing stereo- electroencephalographic (SEEG) recordings in epileptic patients we measured a different electrophysiological correlate of verb phrases vs. noun phrases by analyzing the high gamma band activity (150-300Hz frequency), in multiple cortical areas in both hemispheres, including language areas and their homologous in the non-dominant hemisphere. Our findings contribute to the ultimate goal of a complete neural decoding of linguistic structures from the brain.


Introduction
Human language is a complex system evolved to store, elaborate and communicate information among individuals. Traditionally, it is analyzed as constituted by three major domains: the physical support which is necessary for communication (the acoustic level), the archive of words isolating concepts and logical operators (the lexicon) and a set of rules combining words into larger units (syntax). Meaning is computed by interpreting syntactic structures but it is not strictly necessary to generate well-formed linguistics expressions, given the possibility to construe meaningless structures such as this triangle is a circle 1,2 . The role of syntax in this complex system is crucial for at least three distinct empirical and theoretical reasons: first, syntax can generate new meaning by permuting the same set of words (so for example, Abel killed Cain is different from Cain killed Abel); second, there is no upper limit to the number of words that can enter the syntactic composition: syntax can potentially generate an infinite set of structures; third, it appears to be the real species-specific boundary distinguishing human language from that of all other animals 3 . Unfortunately, given this integrated and complex design characterizing language, isolating electrophysiological information solely related to syntax seems to be impossible by definition, since sound is inevitably intertwined with syntactic information 4,5 even during inner speech 6 : in fact, sound representation is already associated to the words in the lexicon before entering the syntactic computation. The current research has provided three major advancements in the comprehension of syntax: a preliminary distinction between single words in isolation, basically nouns and verbs 7 ; the demonstration that the severely restricted formal properties of syntax "are not arbitrary and culturally conventions" -to put it in Lenneberg's seminal perspective but rather the expression of the morphological and functional architecture of the brain [8][9][10] ; third, the combination of an increasing number of words in sequences correlates with an increasing electrophysiological activity 11 . However, the specific electrophysiological correlates of the syntactic operation as related to specific and different types of words remain completely unknow. We still lack the distinction of basic syntactic structures, such as what correlates with merging of an article with a noun yielding a Noun Phrase (NP) or an object with a verb yielding a Verb Phrase (VP).
Here, we addressed this issue by designing a novel protocol to circumvent this problem and measure the specific electrophysiological correlates of two basic and core syntactic structures factoring out sound representation.
The stimuli were pairs of different sentences containing strings of two words with exactly the same acoustic information but completely different syntax (homophonous strings). More specifically, each pair contained an NP, resulting from syntactic combination of two lexical elements (a definite article and a noun), and a VP, resulting from the syntactic combination of two different types of lexical elements (a verb and a pronominal complement): the NP and the VP were pronounced in exactly the same way. In addition, each VP included a further crucial difference: the object of the verb, realized as a pronoun, was moved from its canonical position on the right of the verb to the left of the verb, a syntactic operation called "cliticization". This novel strategy was made possible by relying on Italian language. For example, the sequence [la pɔrta], could be interpreted either as a noun phrase ("the door") or a verb phrase ("brings her"; lit.: her brings) depending on the syntactic context within the sentence where they were pronounced ( Figure 1). As for the acoustic information concerning the homophonous phrases, it must be noticed that for each pair of sentences containing the same homophonous phrase, either phrase was deleted and substituted with a copy of the other one: this strategy was exploited to avoid the possibility that the structure of the two phrases could be distinguished by subtle intonational or prosodic clues: practically, the relevant part of the stimuli constituting the homophonous phrase was physically exactly the same. Although these results were based on a peculiar property of Italian language, our results are generalizable to other languages because the basic distinction of nouns vs. verbs is universally attested acrosslanguages 12 . As for other variables constituting the homophonous phrases, words were balanced for major semantic features (such as abstract vs. concrete) and length (number of syllables).

Results
The two sentence types were first differentiated by the level of "surprisal", an information-theoretic concept reflecting the expectedness of each word given its preceding context, which is defined as the negative log probability of a certain word in a sentence, given the words that precede it in that sentence 13 . The analysis shows that whereas there is no significant surprisal difference for the Verb/Noun position in the phrase, the values related to the article/clitic position were significantly different ( Figure 2). In fact, the more complex syntactic structure, i.e. the VP involving movement of the object from the right to the left position of the verb, resulted in a higher surprisal level when the same auditory input was interpreted as a clitic rather than an article, as indicated by classical statistics and by decoding in the feature space with a Support Vector Machine (SVM) analysis. Table S1 reports the number of valid cases, the percentage of missing, the mean and the standard deviation relative to the surprisal value, separately for the two experimental conditions. As reported in Table   S2, 84% (n = 26) of the sentences with low surprisal were NPs and 84% (n = 26) of the sentences with high surprisal were VPs.
To fully exploit the potentials offered by homophonous strings we investigated the electrophysiological correlates of these NPs vs. VPs with intracranial electrodes for stereo-electro-encephalography (SEEG) monitoring (see Figure S1 with the visualization information for one subject for the assessment of anatomical electrical sources). Invasive intracranial EEG monitoring allows us to precisely localize sources of activation. This procedure offers a unique opportunity to observe human brain activity with an unparalleled combination of spatial and temporal resolution, impossible otherwise by using the available non-invasive recording and imaging techniques.
A total of 23 patients undergoing surgical implantation of electrodes for the treatment of refractory epilepsy 14 completed all experimental sessions. Only patients without anatomical alterations, as evident on MR, were included. No seizure occurred, no alterations in the sleep/wake cycle were observed, and no additional pharmacological treatments were applied during the 24 h before the experimental recording. Neurological examination was unremarkable in all cases; in particular, no neuropsychological and language deficits were found in any patient. In all patients, language dominance was assessed with high frequency stimulation (50 Hz, 3 mA, 5 sec) during SEEG monitoring. Two patients also underwent a fMRI study during a language task before the electrodes implantation.
Eight patients were excluded after analysis as they exhibited pathological EEG findings. Five patients were also excluded because no explored recording contact showed a task-related significant activation. Demographic data are shown in Table S3. In the remaining 10 subjects, a total of 164 electrodes were implanted (median 16. The temporal lobe was the most explored brain region, with 26 electrodes in DH and 42 electrodes in NDH, followed by frontal lobe (22 electrodes in DH and 21 in NDH).
The central lobe was implanted with a total of 22 electrodes (9 in DH). The Parieto-Occipital region was studied with a total of 9 electrodes in DH and 21 in NDH.
The contacts that exhibited a significantly different response according to whether the homophonous words belonged to VPs or NPs are considered "responsive contacts" (RCs). An example of RC is shown in Figure 3.
To validate the setup processing pipeline we analyzed the ERPImage and event related spectral perturbation (ERSP) of contacts responsive to the auditory stimuli (i.e., Heschl) and highlighted clear auditory event-related potentials (ERPs) and power increase time locked to the stimuli presentations ( Figure S2). Also, we retained in the RCs pools only the contacts where the different response between VPs and NPs was specific to the time region of interest (tROI, time interval that spans from the beginning of Art/Cl to the end of Noun/Verb).
Incidentally, the high gamma frequency interval (150 -300 Hz) showed the greatest tROI specificity in RCs. As an example, a RC (B13) is compared to a Heschl contact in Figure S3 The ERSP analysis indicated that 242 (16.2%) of the leads exploring grey matter exhibited a significant high gamma (150Hz -300Hz) power increase during the presentation of the corresponding VP homophonous phrase with respect to both the baseline and the other words (113 DH, 129 NDH).
The percentage of RCs in the DH was higher compared to that in the NDH (19.3% vs. 15.1%) and the difference was significant (p= 0.044, Fisher's exact test).
Out of 44 RCs (18.2%; 10 DH; 34 NDH) found in the frontal lobe, the majority of which were in the inferior frontal gyrus (13; 29.5%; 3 DH; 10 NDH) and in the frontal part of cingulate gyrus (20; 45,5%; 2 DH; 18 NDH). A detailed description of the localization of RCs for each patient can be found in Table S4. Figure 4 shows all RCs positioned and template-matched after warping each patient's MRI scan 15 .

Discussion
In this manuscript, we defined and exploited a novel protocol to better understand the neural correlates of syntactic structure. In the sEEG signals we always found higher ERSP for VPs with respect to NPs. This was true in particular for the verb/noun segment (see Figure 3, second panel, after the dotted line) but it was also trueeven if with less evidence -for the article/clitic segment preceding the verb/noun one (see Figure 3, second panel, before the dotted line): this strongly supports the conclusion that the observed difference cannot be reduced to the morphological properties of nouns vs. verbs but that it rather pertains to the syntactic operations yielding a VP and a NP.
High gamma activity (>100Hz) is receiving a growing interest to understand and characterize inter-regional cortical communications 16 . This band is one of the most used indices of cortical activity associated to cognitive function, and has been shown to be correlated with the neuronal spiking rate and to the hemodynamic BOLD response measured with functional magnetic resonance in both animal models 17 and in human cortex 18,19 . In particular, a large body of studies have indicated its value in tracking cortical activity during language perception and production 20 , supporting its use as a safer alternative to cortical stimulation for the presurgical mapping of cortical language areas 21 . In the present study, a significant increase of high gamma event related spectral perturbation 22 (ERSP) was a specific index of the exposure to the syntactic contrast between clitic-verb phrases as compared to homophonous article-noun phrases.
This specific impact of syntactic structure on high gamma activity was not limited to the areas traditionally associated with syntactic processing on the basis of lesion effects and functional magnetic resonance evidence 23 . It must be underlined that a robust literature indicates that a categorial morphological contrast between Nouns and Verbs is indeed reflected in terms of lateralization and localisation 7 .
The different activity observed in our experiment must then be related to a different factor, namely the syntactic structure of the stimuli. In particular, given that the physical stimuli where the same and that we did not observe the typical correlates distinguishing distinct lexical categories, such as noun and verbs, the higher activity of VPs can reasonably be correlated with the surviving difference, namely syntactic structure involving the operation of displacement of the object clitic from the right to the left side of the verb. Furthermore, these results suggest that, while syntactic impairment is known to be caused by focal lesions affecting nodal structures in a dedicated network, syntactic processing must involve a much more integrated pattern of brain activity than expected.
Finally, our results concerning syntactic structures converge with parsing as shown by the surprisal analysis.
Syntactic surprisal is related to the expectedness of a given word's syntactic category given its preceding context 24 and is associated with widespread bilateral activity indexed by the BOLD signal 25 . In fact, the position of the object to the left of the verb is reflected in the higher surprisal, showing that this measure is sensitive to syntactic structure.
All in all, the results found in confronting homophonous VPs and NPs allow us to factor out sound from the electrophysiological stimulus and consequently highlight a specific syntactic information distinguishing these universal linguistic structures. Notice that this separation could by no means be obtained by analyzing the electrophysiological correlates of inner speech since it has been proved that acoustic information is represented in higher language areas even when words are simply thought 6 . This first step provided here opens up to a deeper understanding of the structure and nature of human language and contributes to the ultimate far reaching goal of a complete neural decoding of linguistic structures from the brain 26 .

Materials and Methods
Stimuli A novel set of stimuli which capitalizes on three special characteristics of Italian has been provided. First, some definite articles (such as [la] written as la; "the fem.sing.") are pronounced exactly like some object clitic pronouns (such as [la] written as la; "her fem.sing."): both items are monosyllabic morphemes inflected by gender and number. Second, the syntax of articles and clitic pronouns is very different: like in English, articles precede nouns whereas complements follow verbs but, crucially, object clitics are obligatorily displaced the left of the verb with finite tenses. Third, the Italian lexicon contains several homophonous pairs of verb and nouns, such as [ pɔrta] (written porta), which can either mean "door" or "brings". Combining these facts together, a set of pairs of words such as [la pɔrta] (written as la porta) has been construed which could be interpreted either as noun phrases ("the door") or verb phrases ("brings it") depending on the syntactic context (homophonous phrases) they are inserted in. Moreover, in order to be sure that no phonological or prosodical factors distinguish the two types of phrases, the exact copy of the pronunciation of one phrase replaced the other in either sentence in the acoustic stimuli. No other semantic or lexical distinction differentiated the two types of phrases which were balanced for major semantic features (such as abstract vs. concrete).
The acoustic stimuli were recorded using a Sennheiser Microphone MH40P48, Sound Card: Motu Ultralight Mk3, Connection: Firewire 400, Computer: Apple OSX 10.5.8. The stimuli were edited using Audiodesk 3.02 and mastered using Peak Pro7. Files were generated in 16bit, 44.1 kHz (Sampling Frequency); intensity was normalized to 0 Db and rendered in .wav format. All sentences were read by the same person: Italian native speaker, male, 53 years old.

Surprisal value computation
The value of surprisal (S) indicates how unexpected a given word is on the bases of the preceding word 27 . In order to calculate the surprisal value associated to each word of the sentence, we used the algorithms developed The analyses showed significant differences between the value of surprisal of the article compared to that of the pronoun (t= -6.794, p< .001), with a higher surprisal value found for pronouns than for articles. No significant differences were found between the value of surprisal associated to nouns and that associated to verbs (t= 1.357, p = .185).
In order to dichotomize the surprisal variable, we divided the distribution of the surprisal values of articles and clitics on the basis of the median (M = 1.9097) obtained from the occurrence of the lemma in the ITWAC database. The values were divided, respectively, into high and low surprisal.

Patients
A total of 23 patients were recruited for the present study among those who underwent on surgical implantation of multi leads intracerebral electrodes for refractory epilepsy in the "Claudio Munari" Epilepsy Surgery Center of Milan in Italy 30,31 . Only patients with negative MRI and with no neurological and/or neuropsychological deficits were included. Based on anatomo-electro-clinical correlations, each patient-specific strategy of implantation was defined purely on clinical needs, in order to define the 3D shape of the epileptogenic zone (EZ).
Demographic data of the 10 patients included in the analysis are reported in Table S3.

The present study received the approval of the Ethics Committee of ASST Grande Ospedale Metropolitano
Niguarda (ID 939-2.12.2013) and informed consent was obtained.

Surgical Procedure and Recording Equipment
All trajectories of patient-related implantation strategy are planned on 3D multimodal imaging and the electrodes are stereotactically implanted with robotic assistance. The whole workflow was detailed elsewhere 32 . SEEG electrodes are probes with a diameter of 0.8 mm, comprising 5 to 18 2 mm long leads, 1.5 mm apart. A post-implantation Cone-Beam-CT, obtained with the O-arm scanner (Medtronic, Minneapolis, Minnesota), is subsequently registered to pre-implantation 3D T1W MR, in order to assess accurately the position of every recording lead. Finally, a multimodal scene is assembled with 3D Slicer 33 , aimed at providing the epileptologist with interactive images for the best assessment of anatomical electrical sources ( Figure S1).
During the experiment the SEEG was continuously sampled at 1000Hz (patients 1-12) and 2000Hz (patients 12-23) by means of a 192 channels SEEG device (EEG-1200 Neurofax, Nihon Kohden). In each patient, all leads from all electrodes were referenced to two contiguous leads in the white matter, in which electrical stimulations did not produce any subjective or objective manifestation (neutral reference).

Recording Protocol
Each subject rested in a comfortable armchair. Constant feedback was sought from the patient to ensure the overall comfort of the setup for the whole duration experiment. Stimuli were delivered in the auditory modality (see also Figure 1) using Presentation from Neurobehavioral Systems software. Phrases were delivered via audio amplifiers at a comfortable volume for the subject (minimum volume for words to be perceived with ease, according to the subject) while gazing at a little cross on a screen (27 inches). A synchronization TTL trigger spike was sent to the SEEG trigger port at the beginning of auditory presentation (sentence). Jitter and delays were tested and verified to be negligible (less than 1ms). The whole experiment lasted around 30 minutes to maximize engagement. At the end of each task, subjects were asked to answer a few short questions on the content of the stimuli. Indeed, patients were always able to provide correct answers to the questions, thus demonstrating their continuous engagement to the task.
A camera, synchronized to the SEEG recording at source, was used to control for excessive blinking, maintenance of fixation with no eye movement, silence and any unexpected behavior from the patients.

Control Experiment
As a further control for the analysis, the first three subjects underwent an extra auditory task. The modalities remained the same, however the sounds were substituted with beeps (auditory presentation) not carrying any meaning at all.

Data analysis
A band-pass filter (0.015-500 Hz) applied at hardware level prevented any aliasing effect from altering SEEG data. Recordings were visually inspected by clinicians and scientists in order to ensure the absence of artifacts or any pathological interictal activity. Pathological channels were discarded. Further analyses were carried out using custom routines based on Matlab, Python and the EEGlab toolbox 34 . Data were annotated with the events triggered by the beginning of each stimulus. Events were time locked to the beginning of each word (initial syllable of the word for auditory presentation).
Epochs were extracted in the intervals [-1.5 4.5] s, time-locked to the initial presentation (i.e., beginning of the phrase). The length of the epoch was selected so as to always include the complete stimulus presentation (trial).
Epochs with prominent artifacts (e.g. spikes) over significant channels were rejected. To determine significant responsive sites, analyses were performed both in the time and frequency domains. Epochs were then sorted into two classes based on the surprisal value (low or high).

Analyses in the time domain
In the time domain, single-trial data epochs were color-coded by amplitude to form a ERPImage 2D view 34 , without any smoothing over trials ( Figure S2

Analyses in the frequency domain
Time-frequency transforms of each trial were normalized to the baseline (divisive baseline, ranging from -1500ms to -5ms time-locked to the beginning of the sentence), time-warped to the beginning of the sentence, beginning of Art/Cl, beginning of the first and second word after, then they were averaged across trials to obtain the event-related spectral perturbations (ERSPs) a generalization of ERD/ERS analyses to a wider range of frequencies 35 (Ext. Data Figure 2, Panel B), i.e., from theta (1-4Hz) to high gamma (150-300 Hz). A bootstrap distribution over the trials baseline was used to determine significance (p<0.05) of the time-frequency voxels.
We considered the average ERSP across the Gamma ([50 -150] Hz) and High Gamma ([150 -300 Hz]) frequency bands to obtain band-specific ERSP (bERSP) and compared it over time between low and high suprisal (Ext. Data Figure 2, Panel A). These bands were selected after a preliminary analysis of data related to Heschl gyrus in real and control experiments, which highlighted the presence of significant bERSP up to 300 Hz (see Ext. Data Figure 2 panel B). The preliminary analysis also showed that several contacts reported a significant time-specific differentiation in high gamma ([150 -300] Hz) bERSP between VPs and NPs and we used that frequency band to highlight responsive contacts (see next paragraph).

Identification of responsive contacts
Each contact (i.e., channel) for each subject underwent a series of screenings to determine its significance. A contact was deemed responsive if either low or high surprisal high gamma bERSP had significant amplitude specifically in the tROI (interval that spans from the beginning of Art/Cl to the end of Noun/Verb), for a significant time span. The amplitude was deemed significant if and only if greater than 95% of the distribution of amplitudes across frequencies for a significant time span. A time span was deemed significant if longer than the 95% of significant intervals in the baseline. The rationale of this test was to exclude those contacts that did not reach significance in the time ROI and ensure specificity in frequency (i.e., statistically different low and high surprisal high gamma time courses -only one of them being over threshold, or both being over threshold but statistically different -p<0.05), and time, (i.e., no significance when performing the same analysis at other time intervals such as from the second word after Art/Cl to the end of the sentence or from the beginning of the sentence to the beginning of the Art/Cl). Significant contacts were then ranked from high to low values according to the formula * / ∑ where is the maximum amplitude over the time ROI, is the length of the interval within the time ROI the amplitude is significant, and respectively the maximum amplitude and length of the interval at the other positions (i) in the phrase (i.e., outside the tROI). The rationale of this formula was to determine the contacts that highlighted the maximum time-specific significant difference.
An inspection of all the contacts was also visually performed by expert clinicians and results were compared to the data-driven analysis in a double-blind fashion. The concordance was 84%. This analysis provided both validation to the data-driven analysis and also provided an extra control that selected responsive contacts (i) were not located in the white matter, (ii) were not located in affected regions of the brain, (iii) exhibited similar behaviour (e.g., high gamma time course waveform shape) if anatomically close and referring to the same brain region.

Decoding
Decoding of the phrase type (noun and verb phrases) was first performed based on the surprisal relative to the Art/Cl and Verb/Noun parts of the phrases ( Figure 2). After testing for normality (Kolmogorov-Smirnov), VP and NP surprisal values were also statistically compared (ANOVA, 1-Way). Decoding of the two classes was also performed on the feature space formed, for each trial, by the Art/Cl surprisal value and the power amplitude in the time ROI (Figure 3). In both cases a Support Vector Machine (SVM) algorithm with leave-one-out cross validation (LOOCV) was implemented to ensure the generalizability of the model.

Declaration of interests
The authors do not have anything to disclose. In the first sequence [laˈpɔrta] (written here as: la porta) is a Noun Phrase: the article la (the) precedes the noun porta (door).
In the second sequence, instead, the very same sequence is a Verb Phrase: the object clitic pronoun la (her) precedes the verb porta (brings) which governs it. The difference is not only reflected in the distinct lexical classes, there is also a major syntactic difference: in the case of the noun phrase the element preceding the noun, namely the article, is base generated in that position; in the case of the verb phrase, instead, the element preceding the verb in the acoustic stimulus, namely the clitic pronoun, is based generated on the right of the verb occupying the canonical position of complements and then displaced to a preverbal position. This fundamental syntactic difference is represented in the syntactic trees in the picture: "t" indicates the position where the pronoun is base generated in the VP. Notably, to exclude phonological or prosodical factors which may distinguish the two types of phrases, in our experiment the exact copy of the pronunciation of one phrase replaced the other in either sentence in the acoustic stimuli. In other words, subjects heard the very same acoustic stimulus for each homophonous phrase.     They respectively indicate the beginning of the phrase, the beginning of the Art/Cl, the beginning of the first word immediately following Art/Cl (Verb/Noun), the beginning of the word after. A1 and A2 refer to the auditory stimuli respectively before and after the tROI (Art/Cl + N/V syntax construct). B) Time-warped Event-Related Spectral Perturbation (ERSP) respectively for VPs and NPs. The four vertical bars and indicators have the same meaning as in the baselinenormalized power plots (panel A). C) From left to right the bars represent (i) the absolute value of the normalized power difference between VPs and NPs in the intervals A1 (orange) and S (green), (ii) the absolute value of the normalized power of VPs (pink) and NPs (blue) in the intervals A1 and S respectively.*** = p<0.001