Introduction

Transgender individuals describe the incongruence between their assigned sex at birth and their own gender identity to be a significant source of distress1,2,3. Compared to cisgender individuals, trans individuals have higher rates of suicide and suicide attempts4, distress5,6, depression, and anxiety7 and are more likely to be the victims of harassment and violence8. As a result, many seek gender confirmation surgeries or testosterone (T) therapy to bring their physical appearance and/or speech into alignment with their experienced gender. These interventions are generally effective: recipients report greater external validation of their gender from social engagements following treatment9 as well as overall improvements in quality of life10,11,12. For transgender men (referred to as transmen throughout) undergoing T therapy, more masculine speech is correlated with greater reported well-being2.

Two key acoustic characteristics of speech independently contribute to a masculine-sounding voice13,14,15,16: 1) fundamental frequency (fo) of vocal fold vibration, relating to voice pitch, and 2) the spectral structure of speech formants, which give identity to vowels and thus speech content, but also reflect vocal tract length (VTL)17. During puberty in cisgender males (whose natal sex and identity are both male; called “cismen” throughout), both the length and thickness of the vocal folds increase alongside T levels, thereby lowering fo18,19,20,21,22,23,24. The relationship between T and fo remains consistent following puberty such that adult cismen with lower fo also have higher salivary T25,26,27,28 (but not always29). Compared to ciswomen, cismen’s vocal folds are approximately 60% longer30 and speaking fo is, on average, 80 Hz lower31. In addition to vocal fold changes, the larynx also descends in cismen during puberty, resulting in a larger and longer vocal tract. This 10–20% difference in VTL between cismen and ciswomen accounts for overall lower formant frequencies in cismen that occupy a smaller total frequency range17,24,28,31.

In previous studies of T therapy-related changes in the voices of transmen, fo is reliably reduced in most participants after sufficient time on T therapy. Longitudinal studies demonstrate significant (e.g., ~ 60–70 Hz) and stable within-participant reductions in fo after a minimum of ~ 3–4 months on T therapy32,33,34,35. Accordingly, participants typically perceived their own voices to be lower in pitch32 and participants with lower fo reported a higher likelihood of being perceived as male over the phone35. Although T therapy often plays a significant role in female-to-male transition—noted by fo values that are largely indistinguishable from those of cismen for many speakers36—not all individuals experience enough fo lowering to reach the typical range of cismen. A meta-analysis found that 21% of patients fail to reach the fo range of cismen after one year on T therapy37, by which time voice fo lowering has typically reached an asymptote regardless of dose regimen38. Not surprisingly, an estimated 12–16% of patients are not fully satisfied with their vocal transition32,37.

Fundamental frequency (fo) is often reported to be the most heavily weighted cue for listeners in determining speaker gender identity39,40. However, studies of experimentally manipulated speech, in which fo and formant frequencies are varied independently, reveal that listeners rely on more than just fo when making judgments of speaker gender. For example, listeners can correctly identify ciswomen speakers as women even when their fo values are artificially lowered to the range of cismen41. When fo and formant frequencies are altered to the same perceptual degree (based on empirically derived just-noticeable-differences), listeners appear to rely more on formant information when making judgments about speaker body size, masculinity, and physical dominance28,42 (cf. Hillenbrand & Clark15). Thus, speech features other than fo may impact the externally perceived gender identity of individuals undergoing the female-to-male transition, and these additional features, such as formant frequencies or the related measure of VTL, can potentially be used as outcome measures to assess efficacy of transition strategies and increase patient satisfaction. Yet, aside from one case study43 and an unpublished dissertation44, no studies on transgender men have investigated changes in formant measures with T therapy.

The present research has four central research questions. First, how important is masculinization of speech parameters relative to other traits for those undergoing a female-to-male transition? Second, is T therapy effective at masculinizing acoustic properties of speech that drive gender perception: fo (mean and variation, fo-SD) and formant-based estimates of VTL? Research suggests that both are critical to perception of gender; however, little research exists on formant changes in transmen. Third, do higher salivary T levels and a longer duration of T therapy contribute to more masculine speech parameters? Finally, how do transmen rate their satisfaction with speech changes compared to other changes that occurred during T therapy?

Results

How important is masculinization of speech parameters for transmen undergoing T therapy?

Voice masculinity was rated by participants as one of the traits they were least satisfied with prior to transition compared with all other traits (See Fig. 1A); 77% of participants rated voice masculinity as a “1” or a “2” on the 1-to-7 scale (1 indicated that they were extremely unhappy with the trait). Furthermore, when asked to rank the importance of observing change relative to other traits, change in voice masculinity was ranked as most important (see Fig. 1B); 83% of participants ranked it as a “6” or “7” (7 indicated that it was very important to see changes).

Is T therapy effective at masculinizing acoustic properties of speech that drive gender perception?

Results suggest that T therapy is effective at masculinizing transmen’s fo and fo-SD. A univariate ANOVA showed a significant effect of group on fo [F(2, 93) = 164.8, p < 0.001, η2 = 0.78], fo -SD [F(2, 93) = 45.4, p < 0.001, η2 = 0.49], and VTL [F(2, 92) = 47.5, p < 0.001, η2 = 0.51]. Post-hoc comparisons using Tukey’s HSD showed that the fo and fo–SD of transmen were significantly lower than the fo and fo–SD of ciswomen (Cohen’s d = 3.5 and 1.5, respectively), while indistinguishable from those of cismen. None of the transmen fell outside the cismen range for fo (see Table 1); however, seven transmen (23%) had greater fo–SD than the highest value in the cismen group. Transmen’s estimated VTL was significantly longer than ciswomen (Cohen’s d = 1.0), but shorter than cismen (Cohen’s d = 1.4). Further, 23% of the VTL estimates were smaller in transmen than the lowest value in our cismen sample. See Table 1 for the mean, standard deviation, and range of speech parameters for transmen, cismen, and ciswomen groups. See Fig. 2 for visual comparisons of the three groups.

Do higher salivary T levels and a longer duration of T therapy contribute to more masculine speech parameters?

Two individuals with very high T levels (2,889 and 794 pg/mL) were identified using Grubbs outlier test45 and excluded from the following analyses. Salivary T was significantly correlated with lower mean fo (r = -0.37, p = 0.05), but not fo -SD (r = -0.14, ns) or VTL (r = 0.03, ns). See Fig. 3. Testosterone levels were significantly correlated with time on T therapy; individuals who have been on therapy for a longer duration had higher T levels (r = 0.39, p < 0.05). See Fig. 4. Therefore, multiple regression models were then used to examine the independent contributions of circulating T and time on T therapy to each masculine speech parameter. Salivary T significantly predicted lower fo (ß = -0.46, SE = 14.39, t = -2.23, p = 0.04), but T therapy duration did not (ß = 0.19, SE = 5.48, t = 0.90, p = 0.35). As a second step, we added the age at beginning of T therapy to the model. Only salivary T was associated with lower fo; however, it did not reach conventional significance levels (ß = -0.38, SE = 14.53, t = -1.83, p = 0.08). In a separate model predicting VTL, T therapy duration was a significant predictor (ß = 0.46, SE = 0.25, t = 2.30, p = 0.03), while salivary T (ß = − 0.09, SE = 0.67, t = -0.41, p = 0.69) and age at beginning T therapy were not (ß = 0.28, SE = 0.03, t = 1.49, p = 0.15). The model predicting fo-SD was not significant.

How do transmen rate their satisfaction with self-perceived speech changes compared to other changes that occurred during T therapy?

When asked to rate perceived amount of change following T therapy, voice masculinity was rated as the most changed among the surveyed traits with 72% of participants indicating that voice masculinity was “6” or a “7” on the 1-to-7 scale (7 indicated an enormous amount of change observed in the trait). When asked to rate satisfaction with perceived changes, voice masculinity was rated the highest among all survey traits with 77% indicating a “1” or a “2” (1 indicated that they were extremely satisfied with the changes). See Fig. 5.

Discussion

Voice masculinization is particularly important to transgender individuals undergoing a female-to-male transition; compared with eight other masculinity traits, participants indicated that they were least satisfied with their voice prior to transition and ranked it highest in priority for seeing change. Further, after T therapy (which was effective at masculinizing fo and fo -SD in our participants), transmen were most satisfied with their vocal masculinization compared with other traits. Given its importance in a female-to-male transition and the growing number of individuals undertaking this treatment46,47, the need for evidence-based research on voice masculinization is high.

Our results show that, on average, T therapy is effective at masculinizing fo and fo-SD. Transmen’s fo values (mean and range) are comparable to those of cismen and statistically significantly lower than ciswomen’s fo values. While we do not have recordings of these men prior to T therapy, we can assume that their fo was close to the average for ciswomen and that their fo has since changed by 3.5 standard deviations (or more28), which is nearly 80 Hz. Research suggests that 50% of listeners can detect shifts as low as 1.2 semitones (e.g., 7 Hz for a 100 Hz voice42); therefore, these changes likely have a strong impact on perception of gender. Overall, the current findings are consistent with previous results documenting substantial changes in fo with T therapy in transmen32,33,34,35,36 and are suggestive of putative anatomical changes resulting from the action of T on the lengthening and thickening of vocal folds, similar to those occurring during puberty in natal males19,20,21,22,23,24. To understand the nature of these structural changes, future studies should use imaging techniques to objectively quantify vocal fold length and thickness at regular intervals during T therapy.

T therapy may not be sufficient for achieving formant frequencies that are indistinguishable from cismen. Our results showed that transmen’s estimated VTL was significantly longer than ciswomen but shorter than cismen. 23% of our participants’ VTL fell outside the range of our cismen sample, suggesting that T therapy alone does not fully masculinize larynx position. Despite research indicating that both fo and VTL contribute to gendered voice perception13,14,15,16, only one other published study on transmen’s voice changes has examined VTL or formants43. This motivates development of additional treatments, such as behavioral therapy, to increase objective speech masculinity by increasing vocal tract length48. Previous studies on transmen’s speech changes have shown that most changes have occurred prior to 9 months of continuous T therapy32,33,34,35,49; however, these studies did not examine changes in estimated VTL. This study is the first, to our knowledge, to demonstrate statistical differences in VTL between samples of transmen and cisgender speakers [see Cler et al.43 for a single, detailed case study and Papp44 for an unpublished dissertation].

Incomplete masculinization of VTL (as well as fo–SD) may partly explain why 17% of our participants reported that they were ‘neutral’ to ‘extremely dissatisfied’ with changes in their vocal masculinity. This accords with previous studies showing 12–16% of patients are not fully satisfied with their vocal transition and 25% were still sometimes perceived as female on the phone32,37. Further, 31% expressed interest in further masculinizing their speech through additional treatments like behavioral voice therapy32. Despite the need for behavioral voice therapy among transmen, only one published study has examined its efficacy48. This is in contrast to transfeminine individuals where it has been shown to help individuals express their gender identity through speech, reduce gender dysphoria, and improve mental health and quality of life50,51,52,53.

Some of the acoustic properties of speech that drive gender perception were associated with features of T therapy. We found a significant inverse association between salivary T and fo; however, the results appear largely driven by 3 data points (see Fig. 3). Given the small sample size, it is unclear whether these individuals represent the normal range of variation. The association between current salivary T and fo makes theoretical sense given the longer-term association between T administration and fo change in transmen, the presence of androgen receptors on the vocal folds54,55, and the associations between T and fo during puberty18,19,20,21,22,23. In addition, several studies have found links between salivary T and masculine vocal parameters in cismen25,26,28 (cf. Arnocky et al.29), and one study of cismen showed within-individual diurnal decreases in salivary T were associated with increases in fo27. Given the strong empirical and theoretical support for an association between T and fo, it is surprising that two previous studies on female-to-male speech changes35,36 did not find an association between serum T levels and fo.

Although there was not a significant association between salivary T and estimated VTL, T therapy duration was statistically significantly associated with VTL: longer T therapy durations were associated with longer estimated VTLs. This finding may suggest a longer-term relationship between T therapy and VTL. However, an alternative explanation is that this association reflects the confounding effect of time since transition given its close association with duration of T therapy. That is, even without formal voice training, transmen may be implicitly learning how to manipulate their vocal tract over time to achieve longer VTLs. Clinical studies on the relationships among dosing regimen, biological T availability, and speech parameters among transmen are necessary.

In summary, we see two important implications of these findings. First, a voice with a low pitch is a central aspect of masculine gender presentation because it is easily observable, highly sexually dimorphic, and difficult to approximate if not an adult male. Vocal fold dimorphism is one of the largest anatomical sex differences observed in humans (approximately 5 standard deviations28,56) and greater than any other extant ape57. Cisgender men and women differ by 60% in vocal fold length30 but only 8% in height58. Because vocal sexual dimorphism is extensive and, importantly, features little overlap in gender-typical vocal ranges, it is extremely difficult to speak in a voice consistent with the opposite sex, particularly in a sustained fashion31. These facts help explain why transmen are so dissatisfied with their fo prior to T therapy. Similarly, our participants were also highly dissatisfied with body fat distribution, which is also very dimorphic58,59, easily observable, and difficult to change without hormonal therapy.

A second implication of these findings is that more research on speech changes in transgender males is necessary. The studies that have been published are limited by small sample sizes32,33,34,49, a lack of a control group for comparisons32,33,35, and a focus on only fo34,49. Additional research on T dosing regimens as well as the efficacy of behavioral voice therapy are particularly necessary. Better evidence-based treatments for transmen have health and safety repercussions. Transgender individuals are disproportionately targets of violence and being viewed as one’s gender is likely a critical component for safety60,61,62,63. Approximately 20–47% of transgender individuals have been physically or sexually assaulted and an additional 34–46% have been verbally threatened or harassed62,64.

In contrast to voice masculinization, participants did not place high importance on seeing an effect of T therapy on the non-physical trait “psychological masculinity”, highlighting participants’ dissociation between their own perception of gender and outward display of gender prior to therapy65. This incongruence is a source of extreme distress, which is associated with higher levels of depression, anxiety, substance abuse, and suicidal ideation and attempts among the transgender population—particularly those that have not begun to transition5,7,9,65,66,67,68. Receiving hormone treatment significantly improves mental health, social health, and physical health outcomes in transgender populations2,7,9,66. Vocal congruence contributes to these improvements; Watt et al.2 showed that more masculine voices significantly contributed to improved well-being and mental health in female-to-male transgender patients.

To summarize, this research was designed with several goals in mind. First, we aimed to quantify the importance of voice change—relative to other masculine traits—for transgender men undergoing testosterone therapy. No previous studies have explored this question, in spite of the strong interest in voice change among the transmasculine population. Our results show that voice masculinization is of central importance to transgender individuals undergoing the female-to-male transition compared with eight other masculine traits. Second, we asked whether T therapy was effective at masculinizing three gendered speech parameters. Our results show that, on average, T therapy is effective at masculinizing fundamental frequency mean and variation (fo and fo-SD); however, transmen’s formant-based measure of vocal tract length (VTL) was significantly shorter than cismen. This study is the first, to our knowledge, to demonstrate statistical differences in VTL between samples of transmen and cisgender speakers. Third, we examined the association between salivary testosterone and vocal parameters. We found a significant inverse association between salivary T and fundamental frequency but no association with VTL. T therapy duration, however, was statistically significantly associated with VTL. These findings point to the need for more research on speech changes in transgender males—of particular importance are transition strategies that affect formant frequencies, which have largely been ignored in previous research.

Method

Participants

Transgender men

Participants included 30 individuals who were assigned female at birth; four of the participants identified their gender as non-binary and 26 identified as men. All were currently undergoing hormone replacement therapy with T for at least 9 months (Mmonths = 41.50, SD = 25.45) for the purpose of masculinization. Administration routes varied (intramuscular = 13, every 7–14 days; subcutaneous = 11, every 5–14 days; subdermal pellets = 4, every 3–4 months; transdermal = 1, daily). Ages ranged from 20–40 years old (mean: 25.9yrs ± 5.2) and the age when T therapy began ranged from 17 to 38 years old (mean: 22.4yrs ± 5.2); these were closely correlated with each other (r = 0.92, p < 0.001), but neither correlated significantly with time on T therapy or salivary T. Participants reported being primarily attracted to men (N = 6), women (N = 7), or both men and women (N = 15). Two participants declined to answer. Research suggests that lesbian women have lower fo and fo-SD compared to heterosexual women (van Borsel et al., 2017); however, our sample size prohibited analysis by sexual orientation. Participants were excluded if they had been diagnosed with any vocal pathology or if they were habitual smokers in order to minimize the effect of potential spurious relationships. The ethnic composition of the sample was: Caucasian (86.7%), Asian (3.3%), Latin American (3.3%) and multiple ethnicities (3.3%). Recruitment was conducted via digital flyers and notifications on private or closed LGBTQ + or transgender Facebook groups, physical flyers posted at the 2017 Boston Pride festival, and word of mouth.

Cisgender men

Voice samples were collected from 122 cisgender male individuals. From this sample, we selected all individuals aged 21 and older to bring the mean age closer to that of the trans sample. The final N for cisgender men was 34. Ages ranged from 21 to 28 years old (mean: 23.0 years ± 2.2). More detail on this sample can be found in Arnocky et al.29.

Cisgender women

Participants included 32 undergraduate females from a larger study on immune function and phenotypic characteristics. From the larger data set, we selected all individuals 21 and older. Ages ranged from 21 to 32 years old (mean: 22.6 years ± 2.8). The sample was comprised primarily of Caucasian (94%) women.

Procedure

All procedures were approved by the Boston University Institutional Review Board or the Nipissing University Research Ethics Board and were performed in accordance with relevant guidelines and regulations. Informed consent was obtained from all participants.

Questionnaire

Transgender participants answered a survey covering the following topics (7-point Likert scales, 4 = neutral): satisfaction with nine physical and/or psychological characteristics (see below) prior to transition (1 = I was very unhappy; 7 = I was very happy), importance of seeing changes in those characteristics during transition (1 = Changes were not important to me; 7 = Changes were very important to me), amount of observed change in those characteristics since starting T therapy (1 = No change at all; 7 = Enormous amount of change), satisfaction with this change since starting T therapy (1 = I am very unhappy; 7 = I am very happy), and overall satisfaction with their medical transition (1 = I am very unhappy; 7 = I am very happy). The surveyed characteristics were voice masculinity, facial masculinity (excluding facial hair), body hair, amount of muscle mass, physical strength, psychological masculinity, self-identity, sex drive, and body fat distribution.

Voice Recording

All speakers were asked to recite, in their normal speaking voice, the first sentence of the Rainbow Passage, numbers 1 to 10, and vowel sounds in the order /ε/ /i/ /ɒ/ /o͡ʊ/ /u/. Cis- and transmen’s voice samples were collected in a quiet room using an Audio-Technica AT4041 Cardioid Condenser Microphone connected to a Focusrite Scarlett 2i2 audio interface.

For the ciswomen sample, participants were asked to recite the same vowel sounds in the identical order described above. Voice samples were recorded in a sound attenuated room using a Neewer NW-700 condenser microphone with a 48 V phantom power supply and a pop filter. Participant voices were recorded using Goldwave version 6.31 software in mono with a sampling rate of 44.1 kHz and 16-bit quantization. The voice recordings were saved as high quality uncompressed wav files.

For all samples, recordings were analyzed using Praat version 5.369 for mean fo and standard deviation in fo across the utterance (fo-SD) using the ‘voice report’ function. Pitch floor was set to 75 Hz and pitch ceiling was set to 300 Hz for cismen and transmen, and 100 Hz and 500 Hz for ciswomen, which are the recommended parameters for men’s and women’s voices, respectively. Otherwise, default settings were used. We used the Rainbow Passage for analyses involving fo and fo -SD; however, between-group comparisons (i.e., transmen vs. cismen and transmen vs. ciswomen) were similar when either vowels or counting were used for analyses. We also computed fo-CV (fo-SD / mean fo) following Pisanski et al.70 Transmen were not significantly different from either ciswomen or cismen in fo-CV; therefore, we only report fo-SD below.

In order to estimate VTL, formants were calculated using the acoustic recordings of the vowel sounds /ε/ and /ɒ/. All other vowels were omitted from VTL estimates due to narrow constrictions of the vocal tract (/i/) or lip rounding (/o͡ʊ/ and /u/) that complicate the relationship between VTL and formant frequencies71. Praat was used to generate a wide-band spectrogram of the acoustic signal, which was then used to calculate the first four formants using the standard formant tracking software in Praat. For each vowel, these automated formants were visually inspected, and the settings of the tracking software were adjusted until the formants aligned with the spectral representation of the signal. Formant values were calculated over the central stable part of the vowel. Third (F3) and fourth (F4) formant values for each participant were calculated by averaging the values from both vowels. These formant values were then used to calculate VTL estimates via Eq. (1)72, which shows an inverse linear relationship between formants and VTL that is derived from modeling the vocal tract as a uniform tube that is closed at one end (i.e., the vocal folds) and open at the other (i.e., the mouth). In Eq. (1), n is the formant number, Fn is the formant frequency (in Hz), and c is the speed of sound. Finally, VTL estimates from F3 and F4 are averaged for a single VTL estimate for each participant. F3 and F4 are used, because higher formants tend to be more stable and a better estimate of VTL73.

$${\text{VTL}} = \frac{{\left( {2{\text{n}} - 1} \right) \times {\text{c}}}}{{4 \times {\text{F}}_{\text{n}} }}$$
(1)

Saliva collection and analysis

Participants collected saliva samples in a 2-mL cryovial immediately upon waking the morning after they visited the lab to provide questionnaires and voice samples. Mean sample provision start time was 8:44 AM ± 1 h 59 min. Samples were refrigerated immediately after collection and then brought to the research team the following day where they were inspected using the Blood Contamination in Saliva Scale74. They were then stored in a -80 °C freezer until they were shipped overnight on dry ice for analysis. Samples were assayed in duplicate for free T using commercially available enzyme-linked immunoassay kits (DRG International, NJ, USA); average intra- and inter-assay coefficients of variation were 6.84% and 8.97%, respectively. For further detail on collection and assay of saliva samples, see Arnocky et al.29. T means and variance were significantly higher in transmen (M = 321.3 pg/mL, SD = 509, MIN = 47.7, MAX = 2889); however, when two outliers were removed (T = 784 and 2889 pg/mL), the trans mean (M = 212.8 pg/mL, SD = 116.8, MIN = 47.7, MAX = 538.2) was closer to the expected physiological range, compared with published averages for cismen (M = 152.1 pg/mL, SD = 60.5, MIN = 28.7, MAX = 152.0; values from Arnocky et al.29).

Data analysis

The following variables were log-natural transformed to address skew and increase normality: salivary T, time on T therapy, and fo. A univariate analysis of variance (ANOVA) was used to compare groups (transmen, cismen, and ciswomen) for each of the speech parameters (fo, fo-SD, and VTL). Initially, age was added to the model as a covariate and was then removed because it did not affect the analysis outcome. Linear regressions were used to examine associations between salivary T and speech parameters in transmen.