Ji-Sang Hwang, Sang-Hoon Lee, and Seong-Whan Lee
Recently, denoising diffusion models have achieved remarkable performances among generative models in various domains. However, in the speech domain, applying diffusion models to synthesize time-varying audio has limitations in terms of complexity and controllability, because speech synthesis requires very high-dimensional samples with long-term acoustic features. To reduce the burden of model complexity for singing voice synthesis, we propose HiddenSinger, which can synthesize high-quality singing voices using a neural audio codec and latent diffusion models. To generate high-fidelity audio, we introduce an audio autoencoder that can encode audio into the neural audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended into an unsupervised singing voice learning framework to train the model using an unlabeled singing voice dataset. The experimental results exhibit that our model performs better than previous models in terms of audio quality. Furthermore, HiddenSinger-U can synthesize the high-quality singing voices of speakers who trained solely on unlabeled data during training.