Exploring racial and gender disparities in voice biometrics

Systemic inequity in biometrics systems based on racial and gender disparities has received a lot of attention recently. These disparities have been explored in existing biometrics systems such as facial biometrics (identifying individuals based on facial attributes). However, such ethical issues remain largely unexplored in voice biometric systems that are very popular and extensively used globally. Using a corpus of non-speech voice records featuring a diverse group of 300 speakers by race (75 each from White, Black, Asian, and Latinx subgroups) and gender (150 each from female and male subgroups), we explore and reveal that racial subgroup has a similar voice characteristic and gender subgroup has a significant different voice characteristic. Moreover, non-negligible racial and gender disparities exist in speaker identification accuracy by analyzing the performance of one commercial product and five research products. The average accuracy for Latinxs can be 12% lower than Whites (p < 0.05, 95% CI 1.58%, 14.15%) and can be significantly higher for female speakers than males (3.67% higher, p < 0.05, 95% CI 1.23%, 11.57%). We further discover that racial disparities primarily result from the neural network-based feature extraction within the voice biometric product and gender disparities primarily due to both voice inherent characteristic difference and neural network-based feature extraction. Finally, we point out strategies (e.g., feature extraction optimization) to incorporate fairness and inclusive consideration in biometrics technology.

Demographic inequity, like racial and gender disparities in biometric systems, has received significant attention in recent years. There are rising concerns about whether significant differences exist between the performance of the biometric system on subgroups, thereby privileging and disadvantaging specific subgroups. Previous studies have shown that such disparities exist in facial biometrics 1 . In contrast, racial and gender disparities remain unexplored for voice biometrics. Voice biometrics are extensively used in critical biometric systems worldwide in applications related to public services such as online banking 2 , access control 3,4 , healthcare 5 , and smart home technologies 6 . Voice biometrics is a technology that utilizes the recognition of voice patterns to identify individuals. As a practical behavioral biometrics modality, voice biometrics offers many benefits in terms of security, user-friendliness, low cost, and high social acceptance.
However, given the increasing concerns about potential demographic biases in biometrics in general, it is critical to examine whether racial and gender disparities exist in voice biometrics as well and if so, to what extent. Previous explorations have demonstrated that disparities exist in other voice-based systems such as automatic speech recognition 7 . Given that racial and gender differences have been documented in voice inherent characteristics 8 , these differences perhaps affect the performance of voice biometrics on users with different demographic backgrounds. These racial or gender disparities can result in crucial bias issues or other social problems when voice-based systems are deployed on a large scale. Therefore, we aim to explore if the racial and gender subgroups have different voice inherent characteristics and then cause disparities in voice biometric performance, as shown in Fig. 1. To achieve this goal, there are two main challenges. (1). What are the differences among voice inherent characteristics among racial and gender subgroups, and how to reveal these differences?
To evaluate the voice inherent characteristics under demographic factors (racial and gender), we investigate the essential voice properties for each race and gender of the voices in our matched datasets regarding 15 representative fundamental voice metrics: Formants Frequency 9 , Mel Frequency Cepstral Coefficients (MFCC) 10 , Pitch onsets 11 , Root Mean Square (RMS) 12 , Roll-Off 13 , Centroid 14 , Spectral entropy 15 , PDF entropy 16 , Permutation entropy 17 , and SVD entropy 18 . These fundamental metrics represent the essential and primary characteristics of the voice, which are also the base for the voice biometrics system (see details in "Voice fundamental metrics" section). Additionally, the matched dataset means the data samples in the dataset are paired up so that speakers in different subgroups share similar characteristics except for the one factor under investigation, which controls Scientific Reports | (2022) 12:3723 | https://doi.org/10.1038/s41598-022-06673-y www.nature.com/scientificreports/ analysis" section). Subsequently, our system combined with matched dataset and statistical analysis protocols can be used as a tool to evaluate the fairness of various voice biometrics products.

Methods
In this chapter, we illustrate our matched datasets, statistical analysis methods, voice characteristic measurements, and the voice biometric models. All methods and experimental protocols were carried out in accordance with relevant guidelines and regulations and were approved by University at Buffalo Institutional Review Boards (IRB) and informed consent was obtained from all subjects or their legal guardian(s).
Matched dataset. The data used in this work is a subset of mPower-a smartphone-based clinical observational study purely through a smartphone app interface 20 . The voice recording methodology is significantly close to the real practice condition in voice biometrics. Vocal data contains many audio recordings of participants saying ' Aaaaah' for 10 s (hereafter called the snippet). The data is labeled with demographic information such as race and gender. To ensure the data's quality, we manually and carefully checked each recording snippet and eliminated voice snippets with excessive background noise, not recording text as required, or insufficient length.
To better reflect the situation in the real world, we employ both healthy participants and participants with diseases. Some diseases have strong evidence showing not correlated with the human voice (e.g., vocal cord, vocal tract, and articulation). Thus, we only keep participants with vocal, bronchial, and lung disease-based diseases that potentially affect the voice (e.g., Asthma, Pneumonia, Bronchitis, etc.) as unhealthy. We also set the average loudness of all audio data to − 25 db to keep the same recording quality among multiple types of mobile devices. To explore the racial and gender disparities in voice biometrics, we set two matched datasets on race and gender, respectively. In the racial dataset, there are four sub-groups. 75 speakers with 512 snippets are collected for each sub-group. Statistical analysis. We quantitatively assess the voice biometric systems by analyzing the speaker identification performance, primarily regarding the Top-1 identification accuracy. Top-1 accuracy is the rate if the classified one having the highest probability is the same as the genuine speaker. It is a standard measure of the capability of the voice biometric system and reflects the actual performance of the speaker identification in realworld applications 22 . Formally, Top-1 accuracy is defined as: Acc = m/N , where m, and N denote the number of corrected identification and total prediction. A higher Top-1 accuracy indicates a more outstanding speaker identification performance and better performance in voice biometrics. The results are shown in the boxplot format, where the red line represents the median, the top edge of the box is the 25% quartile, and the bottom edge of the box is the 75% quartile. The standard deviation (STD) is also employed to reflect these biometric models' real performance further 32 , and a lower STD represents more stable speaker identification. Besides, in this work, a voice biometric model is considered to have the disparity if significant differences exist between the speaker identification performance of the subgroups, which means having privilege and disadvantage towards specific subgroups. And then, it also shows this voice biometric product contains bias. Here, we employ the significance test with one-way ANOVA and Kruskal-Wallis test (when data is non-normality and unequal variances) 33 . The outcome of the significance test is the p-value based on the dispersion correlation, which is the probability of obtaining test results at least as extreme as the results observed, assuming that the null hypothesis is correct 34 . In this work, the p-value ranges from 0 to 1, and if the p-value is less than 0.05, it indicates a significant difference among the tested data. Moreover, the 95% confidence interval (95% CI) for the true mean difference is utilized to reflect the disparity between subgroups 33 . It is a range of values that's likely to include the true mean difference between subgroups with 95% confidence. And it indicates a significant difference between subgroups when   Onset represents the number of peaks from onset strength envelope; (c) Roll-of is the center frequency for a spectrogram bin that at least 85% of energy is contained within this bin; (d) Frequency of formants (F0, F1, F2) which are local maximums of the spectrum that represents the acoustic resonance of the human vocal tract 9 . (iii) Cepstral: MFCCs are commonly used as features in speech recognition and identification 38 . It concisely describe the overall shape of a spectral envelope. We also analyze first ( △ ) and second derivative ( △ 2 ) are temporal differential of MFCC that represents the rate of changes. (iv) The voice entropy is also employed, which is a measure to describe the information capacity of a voice signal (or saying the maximum information amount can contain in a voice) and is widely considered a fundamental base of voice biometrics 16 . Four representative biometric entropy metrics are utilized: spectral entropy, PDF entropy, approximate entropy, and perm entropy. These voice entropy metrics reflect the effective information capacity by measuring different intrinsic properties of the human voice. (a) The spectral entropy 15 Table 2, which is publicly accessible 39 . (i) The speaker recognition service in Microsoft Azure 24 is a commercial cloud computing service to determine a speaker's identity from within a group of enrolled speakers. This is a publicly accessible and mature commercial product 40 . (ii) 1d-CNN is short for the one dimension convolutional neural network. 1d-CNN takes waveform data as input and uses Mel-spectrogram-32 as the base feature, to which eight one-dimensional trainable convolutional layers are added for feature extraction.   22 , rather than based on a fixed network structure. AutoSpeech uses the pre-processed Spectrogram-257 as the base feature. Several convolutional layers are added on top of the base feature. The structure of the feature extraction depends on the result of the optimization search. Open-source research products working under the default settings are trained and tested on our matched dataset. For either exploration on racial or gender, we train these products with all subgroups based on the matched dataset (detailed in the "Matched dataset" section). It works in the same way as the application scenarios of mainstream voice biometrics products in real world. We perform 5-fold validation for each product. The training-testing ratio of splitting data for each fold is 7:2.
The assessments based on these products run on a workstation with the Linux system (Ubuntu 16.04) on an Nvidia Titan XP graphic card and an Nvidia RTX 2080 graphic card (CUDA version 10.1). We performed k-fold cross-validation as evaluations. The results from each individual fold of validation are averaged, and the standard deviation is calculated for each trial. Since each speaker has an average of five snippets, set k = 5 in this work. The advantage of this approach is that it reduces the effect of anomalous test data on the results and allows all data to be used for training and testing reflecting the actual condition.

Results
As mentioned in "Matched dataset" section, both race and gender datasets involved in results are matched to remove the effect of unbalanced training data on the results.
Voice characteristics analysis. We examine the human voice itself through a set of voice fundamental metrics. 15 representative voice fundamental metrics are utilized to measure the voice property in our matched dataset. These voice metrics reflect different nature properties of the human voice, as illustrated in the "Voice fundamental metrics" section. The box plot of voice measurements for all subgroup is shown in Fig. 3.
First, we perform the testing on the racial dataset. The results are shown in Table 3. There are no meaningful differences among most of these voice metrics, except F0, F1, F2, PDF entropy, and Perm entropy.
This means the voice properties of racial groups are similar and adequate for voice biometrics viewed from the nature voice properties in general, although slight differences exist. Then, for details, there are always performance gaps between these subgroups in some aspects.   Table 4. Significant differences between male and female subgroups exist in various metrics. Thereby, we disclose that there significant difference in voice characteristics exists in gender subgroups but the only a slight disparity in racial subgroups.
Racial disparity in speaker identification performance. After investing the voice characteristics between race and gender subgroups, we explore if the speaker identification performance has a following inseparable relationship with the voice characteristics. We start by computing the Top-1 accuracy for speaker identification across our matched audio snippets within the racial dataset. For the commercial product, the commercial voice biometric model from Microsoft Azure is employed, which is a mature voice biometric product and can work on the non-speech voice. It is worth mentioning that other Tech Giants or companies (e.g., Apple, IBM, Google, Amazon, Facebook) do not have publicly accessible commercial voice biometrics or speaker recognition products on non-speech audio. We also note that since the speaker recognition service of Microsoft Azure (Version 1.14.0, March 2021) is limited to 24 users, therefore, we randomly select 24 different speakers (12 females and 12 males) from the matched dataset when evaluating the racial disparity, and 24 different speakers (six Whites, six Blacks, six Latinxs, six Asians) when examining the gender disparity. Both these selected datasets are matched. Other open-source biometric models follow the complete matched datasets. The identification results  For the open-source products, we build the speaker identification systems from the sketch under the default settings from its original paper based on our matched dataset. There are mainly two types of voice biometric products. Except for the neural network-based type mentioned in the following, the statistic-based type for voice biometric products (e.g., i-Vector and GMM-UBM) primarily utilizes the phonemes (pitch, cadence, and inflection) for speaker identification. However, The ' Aaaaah' utterances are too short to meet statistic-based voice biometric products' requirements. The statistic-based voice biometric products at least need 20-80 unique phonemes with a duration of 1-2 min, which does not apply to the current dataset 21 . For five state-of-the-art neural network-based voice biometric products, Fig. 4 shows that the identification performances of subgroups are all different in these products. For example, for the ResNet-34 model, which has the best overall performance, the Top-1 accuracy for White speakers is 97.33% (STD 4.09%) compared with Asians 94.67% (STD 3.62%), Blacks 90.67% (STD 4.91%), and Latinxs 88.00% (STD 2.34%), respectively. However, for the AutoSpeech model that also has an excellent overall performance, the Black subgroup (90.67%, STD 2.42%) has a better performance than Asian (88.00%, STD 3.71%), and Latinx speakers (85.33%, STD 2.83%), respectively, although the White subgroup still gets the best performance with 97.33% (STD 3.23%). The results illustrate that no particular racial group has the best performance over others among all these speaker identification models, and no specific racial group always has the worst performance. Besides, significant performance gaps are uncovered between these subgroups. In the CNN model, the performances from all subgroups are significantly different in general (p = 0.02), and the White and Black subgroups are remarkably better than the Latinx subgroup (p = 0.01, 95% CI 1.58, 14.15; p < 0.01, 95% CI 2.25, 14.81), respectively. In the AutoSpeech model, a significant difference exists among all subgroups on the whole (p = 0.02), and the White and Black subgroups are outstandingly better than the Latinx subgroup (p = 0.03, 95% CI 1.95,7.66; p = 0.02, 95% CI 1.12, 11.28), respectively. In the ResNet-18 model, the White subgroup is extraordinarily better than the Latinx subgroup (p = 0.02, 95% CI 1. 16, 13.50). These indicate that both commercial and open-source voice biometric models exist disparities among these racial subgroups.
Gender disparity in speaker identification performance. We continue exploring the disparity on the matched gender dataset. For the commercial product from Microsoft Azure, the average Top-1 accuracy for male speakers are 73.33% (STD 8.64%), and for female speakers are 58.33% (STD 6.59%). The performance of male speakers is significantly higher than female speakers (p = 0.01, 95% CI − 26.20, − 3.80), demonstrating gender disparity exists in the Microsoft Azure speaker recognition service. Besides, integrated with 7 , this result further reveals that the racial disparity widely exists in almost all of the cognitive services related to the speech on Microsoft Azure as a result of unbalanced training data samples on the whole platform. For state-of-the-art open-source voice biometric products, Fig. 5 shows that among all these speaker identification models, the performance of female speakers is better than male speakers. For example, for the ResNet34 model with the best overall performance, the average Top-1 accuracy for female speakers is 92.00% (STD 2.97%) compared with 90.67% (STD 3.94%) for male speakers. Also, in the AutoSpeech model with excellent overall performance, the female subgroup gets the better performance with 89.00% (STD 2.18%), which contrasts to 85.34% (STD 4.71%)

Result analysis and discussion
Study on causal factors. There are two representative categories that could account for these racial and gender disparities in the voice biometrics domain, (i) the voice characteristic cause: since different races or genders produce the voice with different properties, it is natural to wonder if the general nature properties (e.g., phonation) of the human voice itself limit the speaker identification performance, and (ii) the technical cause: there are two main components in the voice biometric system: feature extraction and classification. Feature extraction extracts specific characteristics from the original voice snippets, and classification is to verify the user identity based on these learned characteristics. Thus, another important concern is if the technology in the voice biometric model prohibits individual identification (e.g., limited feature selection) 42,43 . The results in the "Voice characteristics analysis" section indicates there is a slight difference between voice for racial subgroups (e.g., in F1). Moreover, there were significant differences between gender subgroups among 15 voice fundamental metrics. Therefore, we investigate causal factors for race and gender separately.
Racial causal factors. The results in the "Voice characteristics analysis" section indicate the radical disparities in voice biometrics are not predominantly from the voice itself since most of the voice fundamental measures do not differ between racial groups. We further scrutinize the technical cause, the biometric technology itself. Due to requirements for the computing time and recourse, these features can only reflect some principal properties of the human voice (not all properties), which can unwittingly amplify the racial disparities in the final voice biometrics outcomes.
Gender causal factors. Since there are significant differences in voice characteristics in the gender subgroups, we hypothesize that the gender disparities are primarily caused by both voice inherent properties and limited feature extraction. Table 1, a list of critical voice fundamental metrics/features are recruited to measure voice properties from different perspectives (e.g., temporal, spectral, and cepstral), which can reflect the voice characteristics of the speaker and aid us to interpret the results of the voice biometric products. To understand the voice features utilized by voice biometric products, we first explore the relation between vocal biological structures and dominant voice inherent characteristics used for speaker identification, as shown in Fig. 6. There are two levels of voice inherent characteristics/properties utilized in voice biometric products. The first level is based on L1 voice properties, the general characteristics of the voice (e.g., phonation). Phonation is the process by which the vocal folds produce certain sounds through quasi-periodic vibration, which also depends on the activity of the muscles and the position of the cartilages of the larynx 44 . The second level utilizes L2 voice properties, the minutiae points (e.g., formant frequency, formant locations). The formant is the distinctive frequency component of the acoustic signal and is usually defined as a broad peak, or local maximum, in the spectrum. The formants are highly determined by the length of the vocal tract and vocal fold. We can assess the acoustic resonance of the vocal tract by searching spectral peaks of the sound www.nature.com/scientificreports/ spectrum. The formant with the lowest frequency is called F1, and then the second F2. Most often, the two formants, F1 and F2, are fundamental and crucial characteristics in the human voice, including non-speech and speech voice 45,46 .

Disparities sources detection. As shown in
As discussed in the "Voice characteristics analysis" section, for racial subgroups, the principal differences in voice properties result from phonation/formant. Latinx vowels (including /ɑ/) are generally shorter (in duration) than other subgroups, vary little in quality and remain contrastive in stressed and unstressed positions 47 . Moreover, Latinx speakers have lower F1 during isolated /ɑ/ prolongations compared to White speakers. Besides, for gender subgroups, male speakers have longer vocal tract and vocal fold dimensions and lower formants centralized within the low-frequency band on the spectrum 48,49 . After examining this taxonomy for voice biometrics, we continue to discover how the technology prohibits individual identification.
To further disclose the source of these racial and gender disparities, we examine the learned voice characteristics/properties in the feature extraction outputs of these products. The feature extraction usually includes the base features and neural network-based feature maps. Considering that ResNet-18, ResNet-34, and AutoSpeech use the same classifier and base feature (spectrogram) and have different preferences on our matched datasets mentioned above, the voice characteristics (feature weights) learned from these three models are shown in Fig. 7. The results show learned features within these three biometric models mostly weigh on the voice properties related to formants 50 . In the racial group, Latinx speakers have lower F1 during isolated/ɑ/prolongations than White speakers 51 , making these feature extractions more difficult to locate the F1 band based on convolution layer-based solutions. Consequently, the feature extractions are limited to extracting the useful voice characteristics from Latin speakers in both ResNet-18 and AutoSpeech models, which jeopardizes the final speaker identification decision and causes racial disparity. Besides, in the gender group, the males' formats are mainly located on the low-frequency area, and subsequently, the texture on the males' spectrogram repeats more irregularly compared to females' . Since the classic convolutional kernel utilized in these products is less effective in generalizing such irregular patterns due to shape mismatch 52 , the neural network-based feature extraction is restricted to further unearth the voice identity on these three models 53,54 . Moreover, similar situations can be observed in the rest research voice biometric models. Thereby, we further discover that racial disparities primarily result from the neural network-based feature extraction within the voice biometric product and gender disparities primarily due to both voice inherent characteristic difference and neural network-based feature extraction. Disparities discussion. As noted above, our findings indicate that the overall racial subgroup has a slight difference in voice inherent characteristics (e.g., in F1), and differences in genders subgroups are gigantic. Disparities exist between both racial and gender groups in several biometric products (e.g., ResNet-18, AutoSpeech) towards particular subgroups (e.g., Latinxs). We identify racial disparities in voice biometrics are not primarily related to the voice characteristics, but from the technical cause, a gap in the feature extraction. On the other hand, gender disparities are primarily related to the voice inherent characteristics and the feature extraction technology. The results indicate that the neural network-based feature extractions are limited in learning the comprehensive voice characteristics for voice biometrics to some extent 55 .
Currently, AutoSpeech is widely recognized to achieve the highest speaker identification performance among the open-source research products (noted on July 2021: AutoSpeech achieves the best performance of speaker identification on VoxCeleb1 verified by Paperwithcode), but it has perceptible racial and gender disparities. Our findings reveal that when designing the voice biometric product, rather than only focusing on the entire performance of the representative organized voice dataset (e.g., VoxCeleb1 56 ), we also need to pay attention to www.nature.com/scientificreports/ the subgroups' performance. Besides, to improve the speaker identification performance or mitigate disparities, feature extraction optimization is also an option 57,58 , more than just working on the classifier. Deep features are high-level features that are automatically learned by the deep neural network through the data in several iterations. The understanding and interpretation of deep features is still a challenge, so manual intervention to avoid model bias toward demographic backgrounds is very difficult. Therefore, to overcome this problem, our system can be used as a tool for evaluating voice biometric products, quantifying the fairness of the voice biometric model through matched datasets. Moreover, it can provide indications for multi-model fusion to reduce voice biometrics product bias.
In our study, the speakers collected in our matched datasets are from 15-70 years in each subgroup. Most speakers are in the generation of 20-40 years. Nevertheless, it is possible that at least some of the differences we see are mainly a result within the 20-40 years generation, not all ages. This does not revoke the discovery of radical and gender disparities in voice biometric models. We hope to extend the future work by examining the voice biometrics performance among speakers from other generations.
Furthermore, it is time for related researchers, engineers, investors, and governors to rethink this technology comprehensively to ensure that it has a low possibility of causing potential hazards or bias toward particular subgroups. Besides, another problem we need to care about is to prevent such disparities affecting the prevailing cultural, social norms, or legal regulations and to avoid secondary victimization.

Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request. The data are not publicly available because they contain information that could compromise research participant privacy/consent.
Received: 17 July 2021; Accepted: 1 February 2022 Figure 7. The voice biometric system is mainly divided into two parts: feature extraction and classification. The feature extraction part includes the base feature and neural network-based feature maps. ResNet-18, ResNet-34, and AutoSpeech utilize the same classification and base feature (spectrogram). The racial disparity is discovered in ResNet-18 and AutoSpeech, and gender disparity is detected in these three. The neural network-based feature extraction primarily causes these disparities. F1 is noted for the first format here.