VSUGAN unify voice style based on spectrogram and generated adversarial networks

In course recording, the audio recorded in different pickups and environments can be clearly distinguished and cause style differences after splicing, which influences the quality of recorded courses. A common way to improve the above situation is to use voice style unification. In the present study, we propose a voice style unification model based on generated adversarial networks (VSUGAN) to transfer voice style from the spectrogram. The VSUGAN synthesizes the audio by combining the style information from the audio style template and the voice information from the processed audio. And it allows the audio style unification in different environments without retraining the network for new speakers. Meanwhile, the current VSUGAN is implemented and evaluated on THCHS-30 and VCTK-Corpus corpora. The source code of VSUGAN is available at https://github.com/oy-tj/VSUGAN. In one word, it is demonstrated that the VSUGAN can effectively improve the quality of the recorded audio and reduce the style differences in kinds of environments.

In recent years, popular online courses like SPOC and the applications of flipped classrooms have brought a large number of demands for regular course recording 1 . The audio records in different pickups and recording environments may generate additional background noise, which can be clearly distinguished by human ears after splice 2 , and lead to different sound qualities to affect recorded courses. Traditionally, this problem can be solved by manually adjusting the sound waveform or frequency spectrum during post-production or removing noise by denoise algorithms 3,4 . However, the involved post-production in regular course recording applications is usually unprofessional, and manual adjustment is time-consuming. Traditional denoising algorithms, such as spectral subtractive 5 , subspace 6 , statistical-model based 7 and Wiener algorithms 8 only remove part of the background noise and cannot solve the problem completely. Meanwhile, neuronal networks used for speech enhancement, such as SEGAN 9,10 , mostly focus on obtaining clear speech but fail to unify audio styles according to different environments.
Voice style unification also known as voice style transfer refers to combining a speaker's timbre, paralanguage (mood and intonation), and other characteristics into synthesized audio. Through decades of development, voice style transfer technology has obtained many achievements due to the applications of voice conversion technology. For instance, Valbret et al. proposed a method based on Pitch Synchronous Overlap and Add technology to realize voice transformation 11 , and Desai et al. used BP neural network to achieve speech conversion 12 . Thanks to the development of deep learning, especially long short-term memory networks, the performance has been improved significantly 13 . Moreover, in order to further enhance the quality of voice conversion, Donahue et al. presented wave GAN based on deep convolutional generative adversarial networks (DCGAN) 14 . However, existing generative adversarial networks (GAN) only solve the fixed one-to-one or many-to-many voice conversion scenarios. Once involving new speakers, the GAN should be retrained with transfer learning, which is reduplicative and unnecessary.
In the current study, a voice style unification model based on generated adversarial networks (VSUGAN) is established to unify voice style in different environments without retraining the network for new speakers. VSUGAN combines the style information from the audio style template and the voice information from the processed audio. In this method, background noise is also considered as a part of the audio style. The input consists of audio style template and noise-mixed audio, while the output is target-style audio. The contributions of this paper are as follows: Spectrogram and voiceprint. The spectrogram is obtained by short-time Fourier transform (STFT) of the voice signal. The voice signal, i.e., waveform signal, is first divided into a number of overlapping frames according to the time window and then converted to the frequency spectrum by fast Fourier transform (FFT) frame by frame. Next, the frequency spectrum is arranged in the frame order to form a spectrogram 15 . The x-axis of the spectrogram denotes time, and the y-axis stands for frequency. The amplitude of a particular frequency at a specific time is represented by colors, where dark colors correspond to low amplitudes and brighter colors relate to progressively stronger amplitudes. The change of background noise and voiceprint after a piece of audio through VSUGAN can be clearly observed in the spectrogram. In the experiment, the librosa library was utilized to transfer the waveform signals to spectrogram with the FFT window size of 512. Meanwhile, voice waveforms and spectrograms can reflect different voice recording effects in diverse recording environments. Figure 1a shows the waveform and spectrogram of a piece of audio recorded in four different environments. The corresponding recording settings are shown in Table 1. It can be seen from Fig. 1a1   www.nature.com/scientificreports/ Background noise. Background noise in different environments generates noise points with kinds of distributions and shapes on the spectrogram 16 . Figure 1b shows the waveforms and spectrograms of the same audio after overlaying different noises, where b1 is the original clean audio, b2 is mixed with cafeteria noise, b3 is combined with driving car noise, and b4 is integrated with Gauss white noise. It is clear from b2-2, b3-2, and b4-2 that the frequency of cafeteria noise and car noise is concentrated in the low-frequency area, while the Gauss white noise is distributed in the whole frequency area.

Voice style unification based on GAN
Algorithm and system structure. GAN consists of two parts: generator G and discriminator D . Generator G learns to map samples z from some prior distribution P z to the target distribution P G , However, P G is unknown, and discriminator D is designed to judge the similarity between P G and prior distribution P data . Discriminator D is trained to distinguish samples x from the prior distribution P data and the fake samples generated by G . Similarly, Generator G is trained to make its output data G(z) deceive the discriminator D as much as possible, so that the discriminator D cannot distinguish whether the data comes from the prior distribution P data or P G . Alternately training D and G can increase their completion abilities until the data generated by G meets the requirements 17 . This kind of adversarial training in classical GAN can be described as: Assuming an extra input x s need to be added into the classical GAN, the output G(z) of the generator can also have some properties related to x s . Thus, the adversarial training with the addition of x s can be described as: In order to train VSUGAN more conveniently and stably, Pearsonχ 2 divergence 18 is applied to replace KL divergence in classical GAN: The workflow of generator network G is shown in Fig. 2a, which has two encoders and one decoder. The original utterances are down-sampled to 16,384 Hz and sliced into a group of segments with a length of 4 s. The segment length of 4 s is chosen because it is difficult to extract enough style information from shorter speech. The input of the encoder is the 257 × 513 × 1 spectrogram obtained by STFT of voice segments. One encoder (encoder for noise, termed as n-Encoder) is employed to extract the content information (Info Code) from the spectrogram of a-1 audio, and the other encoder (encoder for style template, using s-Encoder for short) is exploited to extract style information (Style Code) from the spectrogram of a-2 audio. Subsequently, content and style information, i.e., Info Code and Style Code, are combined and fed to the decoder. Next, the decoder outputs the spectrogram with a unified style, and the spectrogram generates the target audio through Inverse Short-Time Fourier Transform (ISTFT). Griffin Lim algorithm 19 is used for generate phase signals in ISTFT.
At the same time, Fig. 2b illustrates the discriminator workflow. The input of the discriminator is the combination of the generator output spectrogram and the style template spectrogram. The output is the judgment of whether their styles are consistent.
Generator configuration. The configuration of the generator is illustrated in Fig. 3. The n-Encoder and the s-Encoder which share the same structure in the generator network are designed to extract content and style information. They compress the input spectrogram of 257 × 513 × 1 into the encoding information of 2 × 3 × 2048 by eight encoder units, which are applied to downsample the image by a 3 × 3 convolution kernel with a stride of 2, and no pooling layer is utilized similar to DCGAN 20 . After convolution, the data size changes from (h, w) to ( 1 2 (h + 1), 1 2 (w + 1)) , and the convoluted data is activated by REctified Linear Unit (ReLU) 21 . Similar to the structure of the encoder, the decoder decodes the 2 × 3 × 4096 encoded information into 257 × 513 × 1 spectrogram through eight decoder units, which also apply a 3 × 3 convolution kernel with a stride of 2 to upsample the image through fractionally strided convolutions. At the same time, skip connections are concatenating with the output result of the previous decoder and the input of the encoder 9 . Thus, skip connections are exploited to reduce the loss information and solved gradient explosion and gradient disappearance in training 22 . Discriminator configuration. Figure 4 shows configuration of the discriminator. The discriminator concatenates two 257 × 513 spectrograms into a 257 × 513 × 2 tensor and generates 5 × 9 × 1024 feature maps through six convolution layers by a 3 × 3 convolution kernel with a stride of 2. The convoluted results are normalized by batch normalization and activated by ReLU. Then, the convoluted feature map is fed to the 5-layer fully connected network, and finally a score between [0, 1] is calculated to judge the styles similarity of two input voice segments. VSUGAN training. Data preparation. The data set used in VSUGAN is constructed by THCHS-30 corpus, which is an open-source Chinese speech corpus and contains 13,388 Mandarin sentences of 60 speakers. All the voice was recorded in a quiet office environment. In addition, the corpus includes three kinds of 0 dB  23 . Firstly, the whole corpus is read and resampled to 16,384 Hz. And then, the speech and noise are segmented into segments per 4 s. According to different speakers, all speech segments are divided into the training set and testing set. The training set contains all the speech fragments of 51 randomly selected speakers, and the testing set includes the speech pieces of the rest nine speakers.
Let the set of all speakers as A , and the speech fragments set of one speaker a as C a . During the training, the spectrogram of the sample c a from C a is processed via image morphology algorithm to obtain ∼ c a by changing the voiceprint. Here, ∼ c a is used to destroy the original voice details and simulate the voiceprint changes caused by different pickups or environments. Meanwhile, the image morphology algorithm applied to process the spectrogram of the sample c a is randomly selected from the eight algorithms in Table 2.
Besides, a sample noise n from noise set N are mixed proportionally with ∼ c a to generate z a with the proportion r = random(0, 0.3): where the set Z a including all z a is the mixed noise audio set of the speaker a.
In addition to c a , another sample x s from C a is considered as a style template. Since the recorded environment and pickup of c a and x s are the same with speaker a , it can conclude the assumption that the styles of c a and x s are the same. Therefore, c a is used as a label to evaluate whether the generator output audio style is similar to x s in VSUGAN. It can transform the "judgment of style consistency" from an unsupervised learning problem to a supervised learning problemmake. The present design is very important for the discriminators' training.
Loss function. The loss function consists of two parts: one is the L1 loss judge the loss degree of information by calculating the output G(z a , x s ) of the generator and c a , the other is the discriminator loss to judge the loss degree of style. www.nature.com/scientificreports/ L1 loss. Given the training data of the mixed noise audio z a , the clean audio c a , and the style template audio x s , the L1 loss is defined as:  Discriminator loss. Given the training data of the mixed noise audio z a , the clean audio c a , and the style template audio x s , the discriminator loss is denoted as: When training the generator, the output of discriminator D is considered as a part of the generator loss to measure the loss of style: Total loss. Combining the L1 loss and the discriminator loss mentioned above, the total loss of the generator is: where K is termed as the hyper-parameter to control the weights of the two losses. Initially, K was set to 100. But in VSUGAN, it can be observed that the L GD was one order of magnitude lower than the L GL1 . While K is set to 10, the two parts of the total loss achieve the best balance.
Training. Adam optimizer is used in both generator's training and discriminator's training in VSUGAN. The learning rate of generator is set to 0.001, and the value of discriminator is set to 0.0001. As a default, beta1 and beta2 are set to 0.9 and 0.999, respectively. VSUGAN is trained for 44 epochs using a batch size of 5. If the training is continued more than 44 epochs, the model will result in overfitting.

Experimental results
Evaluation metrics. Signal noise ratio (SNR) is an indicator of the amount of noise in the measured audio.
The higher the SNR represent the lower amount of noise. SNR calculates the ratio of the output signal power signal to the output noise power noise . The expression of SNR is formulated as: For a group of training or testing data, the mixed noise audio z a , clean audio c a , style template audio x s , and the generator output G(z a , x s ) are used as signal. The difference between the audio to be measured and c a is used as noise, the formulation changes to: SNR out is the SNR of the generator output, and SNR z a is the SNR of z a , SNR is the different between SNR out and SNR z a . Style unification is more effective when SNR is higher.
Mel Cepstral Distortion (MCD) is usually utilized in voice conversion tasks to calculate the similarity between target voice and converted voice. Using Mel-Cepstral coefficients (MCEP) 24 , MCD calculates the Euclidean distance 25 between the Mel-cepstrum of two voice signals. While the MCD value is lower, the similarity between SNR out = 10 log 10 ||c a || 2 ||out − c a || 2 SNR z a = 10 log 10 ||c a || 2 ||z a − c a || 2 SNR = SNR out − SNR z a www.nature.com/scientificreports/ the two voices is higher. In the testing of VSUGAN, the original clean audio c a was used as a reference to calculate the MCD of mixed noise audio z a and generator output G(z a , x s ) . In the implementation, pyworld library, pysptk library, and librosa library were used to analyze MECP and MCD 26 . In the present work, MCD of the generator output is named MCD out and MCD of the z a is named MCD z a . MCD equals MCD z a minus MCD out . The MCD is positively correlated with model performance.
In order to further evaluate the performance of the VSUGAN, two additional indicators (PESQ 27 and STOI 28 ) were chosen. And Pypesq and Pystoi libraries are used for analysis of PESQ and STOI.
Training Results. Figure 5 shows the effect of a piece of audio processed after the network. Figure 5a is the audio input of the network with the waveform (a1), spectrogram (a2), and partial enlargement (a3). The audio is a sample from the testing set with destroyed voiceprint details on the spectrogram, mixed with 30% Gaussian white noise. Figure 5b illustrates the audio output from the generator, where another clean audio of the same speaker is used as the style template to perform style consistency calculation on the input audio of Fig. 5a. And the original audio of Fig. 5a is shown in Fig. 5c. Through comparing (a2) and (b2) in Fig. 5, it can be seen that the background noise is basically filtered out by the VSUGAN. Besides, it is clear that the damaged voiceprint details in (a3) are recovered to a certain extent in (b3), by comparing (a3), (b3), and (c3) in Fig. 5.
The statistics of ΔSNR and ΔMCD of testing set audio with different background noise is calculated in VSUGAN (Fig. 6). In the testing set, the input of (a) and (b) in Fig. 6 is nine sample audio from nine speakers whose original voiceprint details are destroyed by the image morphology algorithm on the spectrogram and then mixed with noise, i.e., cafe noise, car noise, and white Gaussian noise. The mixed noise intensity is 0-99% of 0 dB with increasing 1% every intensity. Therefore, SNR and MCD of nine input and output audio are obtained. With the larger SNR and MCD , the output style is more consistent. From Fig. 6a, b, we can find that the values of SNR and MCD have significant correlation with noise intensity. When the noise intensity exceeds 30%, SNR value is about 4-10 dB, where Gaussian white noise is significantly greater than other two kinds of noise. On the contrary, the MCD with Gaussian white noise is lower while the noise intensity exceeds 30%. Due to the repaired voiceprint in VSUGAN, SNR and MCD of output audio are better than that of input audio, even if the mixed noise intensity is 0%.
In Figs. 6c, d, all original audio in the testing set which is mixed with three kinds of noise in the fixed intensity after destroying the voiceprint details are used as the network's input audio. As shown in Figs. 6c, SNR of the audio mixed with Gaussian white noise is significantly higher than mixed with other two kinds of noise. However, MCD value of the audio mixed with Gaussian white noise is lowest in three kinds if noise (Fig. 6d). Thus, it can be seen that VSUGAN has different degrees of improvement for types of noise.
Along with the testing set of THCHS-30, VCTK-Corpus 29 and NoiseX-92 dataset were applied to validate the performance of VSUGAN. NoiseX-92 which have 15 kinds of noise (including white noise, pink noise, and vehicle interior noise) is part of the Signal Processing Information Base 30 . In the performance test, ten audio samples of nine speakers were randomly sampled from two datasets (90 audio samples per dataset). These audio samples were also mixed with the noise of random intensity after the voiceprint details were destroyed. As shown in Table 3, the noise in group1 and group2 has the same source as the noise mixed in during training. The noise in group3 was selected randomly from the NoiseX-92 dataset and has not been trained. Group4 has the same corpus as group1. In order to simulate the real acoustic environment, rir_generator 31 is used in group4 to generate reverberation without the image morphology algorithm. This mixed-noise audio was fed into VSUGAN,  The experimental results is shown in Table 3. Compared with input noisy, the indicators of SNR and MCD have significantly improved in all four groups. And PESQ value has advanced in group1 and group2 but not in  www.nature.com/scientificreports/ group3 and group4. The STOI value has few difference between group1-group3, while is increased slightly in group4. Thus, the current VSUGAN has stable performance in different data sets. For the reproducible research, our source code was uploaded to Github (https:// github. com/ oy-tj/ VSUGAN). The data set and trained model are shared on an online disk (https:// pan. baidu. com/s/ 1Rwvp wZjSE T7hrL vpfvN izA password: 4y8a).

Conclusion
In the present study, we proposed and implemented a model based on GAN by combining the style information extracted from the style template audio and the voice information extracted from the audio. VSUGAN without training for extra speakers generates the audio as same style as the template. VSUGAN is trained with THCHS-30 corpus and tested on two open-source corpora. The experimental results demonstrated that VSUGAN can effectively improve the quality of the recorded audio and reduce the style differences in kinds of environments.