Generative Adversarial Network for speech synthesis

Seham Nasr
3 min readFeb 13, 2021
The image is taken from Speech Signal Processing Laboratory

Generative adversarial networks (GANs) achieved a plausible quality in image generation in computer vision. Recently, the NLP community shows intensive interest in using GAN for speech synthesis. Using GANs in text to speech would be more challenging because of the text and speech nature, but at the same time it will improve speech synthesis task and overcome some problem in conventional approaches.

GAN consists of two separated neural networks: generator and a discriminator. The generator takes in a random variable, z following a distribution Pz(z) and attempt to map it to the data distribution Px(x). The output distribution of the generator is expected to converge to the data distribution during the training. On the other hand, the discriminator is expected to discern real samples from generated ones by outputting zeros and ones, respectively. During training, the generator and discriminator generate samples and classify them, respectively by adversarially affecting the performance of each other. This is a two-player minimax game.

Speech synthesis is the process of converting the text into humanlike voice. Two conventional approaches were used to TTS task: concatenative TTS and parametric TTS [1]. Deep-learning based method maps from linguistic features to acoustic features with deep neural network, DL-method achieved an efficient tool to learn features from data [1]. There are different DL-models that provide speech synthesis such as: deep belief network (DBN) [2], deep mixture density network (DMDN) [3], deep bidirectional long short-term memory (DBLSTM) [4, 5], WaveNet [6], and Tacotron and convolutional neural network (CNN) [7, 8]. But with massively todays parallel computing and big data, recently GAN-based speech synthesis is applying to get more efficient parallelized model. GAN-TTS introduced in [9] as a conditional feed forward generator which generate raw of audio speech (based on multi-frequency random windows ) and the discriminator operates on random windows of different size (ensemble random window discriminators)and examine how the generated audio is related to the desired utterance. They focused on linguistics and pitch features while processing of TTS [9].

Since we chose the GAN-TTS in our work, we should adjust our dataset as mentioned in [9] to fit the model and achieve the goal. The input to G is a sequence of linguistic and pitch features at 200Hz, and its output is the raw waveform at 24kHz. Working with GAN-TTS requires high performance GPU for training, increasing number of discriminators will require more GPUs [9].

References:

  1. Ning, Yishuang, et al. “A review of deep learning based speech synthesis.” Applied Sciences 9.19 (2019): 4050.
  2. Kang, Shiyin, Xiaojun Qian, and Helen Meng. “Multi-distribution deep belief network for speech synthesis.” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.‏
  3. Zen, Heiga, and Andrew Senior. “Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis.” 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014.‏
  4. Li, Xinxing, et al. “A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.
  5. Mousa, Amr El-Desoky, and Björn W. Schuller. “Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks for Grapheme-to-Phoneme Conversion Utilizing Complex Many-to-Many Alignments.” Interspeech. 2016.
  6. Oord, Aaron, et al. “Parallel wavenet: Fast high-fidelity speech synthesis.” International conference on machine learning. 2018.
  7. Skerry-Ryan, R. J., et al. “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron.” arXiv preprint arXiv:1803.09047 (2018).‏
  8. Tachibana, Hideyuki, Katsuya Uenoyama, and Shunsuke Aihara. “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
  9. Bińkowski, Mikołaj, et al. “High fidelity speech synthesis with adversarial networks.” arXiv preprint arXiv:1909.11646 (2019).

--

--