Ji-Sang Hwang, Sang-Hoon Lee, and Seong-Whan Lee
Recently, denoising diffusion models have achieved remarkable performances among generative models in various domains. However, in the speech domain, applying diffusion models to synthesize time-varying audio has limitations in terms of complexity and controllability, because speech synthesis requires very high-dimensional samples with long-term acoustic features. To reduce the burden of model complexity for singing voice synthesis, we propose HiddenSinger, which can synthesize high-quality singing voices using a neural audio codec and latent diffusion models. To generate high-fidelity audio, we introduce an audio autoencoder that can encode audio into the neural audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended into an unsupervised singing voice learning framework to train the model using an unlabeled singing voice dataset. The experimental results exhibit that our model performs better than previous models in terms of audio quality. Furthermore, HiddenSinger-U can synthesize the high-quality singing voices of speakers who trained solely on unlabeled data during training.
Script: 서로를 향해 다가갔지
(Pronunciation): seolol hyanghae dagagassji | ||||
---|---|---|---|---|
GT | Audio Autoencoder (Recon.) | HiddenSinger | HiddenSinger-U |
Script: 아 그댈 바라만 보는게 힘들어서
(Pronunciation): a geudael balaman boneunge himdeul-eoseo | ||||
---|---|---|---|---|
GT | Audio Autoencoder (Recon.) | HiddenSinger | HiddenSinger-U |
Script: 다 같이 모두 모여
(Pronunciation): da gat-i modu moyeo | ||||
---|---|---|---|---|
GT | Audio Autoencoder (Recon.) | HiddenSinger | HiddenSinger-U |
Script: 쉬게 해줘요 그대 사랑이
(Pronunciation): swige haejwoyo geudae salang-i | ||||
---|---|---|---|---|
GT | HiFi-GAN (Recon.) | |||
FastSpeech 2 | DiffSinger | VISinger | HiddenSinger | |
HiddenSinger-U |
Script: 멋지지 않나요 뭔가를 사랑했던
(Pronunciation): meosjiji anhnayo mwongal salanghaessdeon | ||||
---|---|---|---|---|
GT | HiFi-GAN (Recon.) | |||
FastSpeech 2 | DiffSinger | VISinger | HiddenSinger | |
HiddenSinger-U |
Script: 나름대로 빛났을테니
(Pronunciation): naleumdaelo bichnass-eulteni | ||||
---|---|---|---|---|
GT | HiFi-GAN (Recon.) | |||
FastSpeech 2 | DiffSinger | VISinger | HiddenSinger | |
HiddenSinger-U |
Script: 쓸쓸하게 비추는 거야
(Pronunciation): sseulsseulhage bichuneun geoya | ||||
---|---|---|---|---|
GT | HiFi-GAN (Recon.) | |||
FastSpeech 2 | DiffSinger | VISinger | HiddenSinger | |
HiddenSinger-U |
Script: 너에게 난 빠져있는거야
(Pronunciation): neoege nan ppajyeoissneungeoya | ||||
---|---|---|---|---|
GT | HiFi-GAN | |||
VISinger (recon.) | Autoencoder w/o reg. | Autoencoder w/ KL-reg. | Autoencoder w/ RVQ-reg. |
Script: 모두 다 잊혀
(Pronunciation): modu da ijhyeo | ||||
---|---|---|---|---|
GT | HiFi-GAN | |||
VISinger (recon.) | Autoencoder w/o reg. | Autoencoder w/ KL-reg. | Autoencoder w/ RVQ-reg. |
Script: 여기저기 뛰어도 보고
(Pronunciation): yeogijeogi ttwieodo bogo | ||||
---|---|---|---|---|
GT | HiFi-GAN | |||
VISinger (recon.) | Autoencoder w/o reg. | Autoencoder w/ KL-reg. | Autoencoder w/ RVQ-reg. |
Script: 해준 것만 생각나
(Pronunciation): haejun geonman saeng-gagna | ||||
---|---|---|---|---|
GT | ||||
Latent Generator w/o reg. | Latent Generator w/ KL-reg. | Latent Generator w/ RVQ-reg. (proposed) |
Script: 우리의 스토리 love me love
(Pronunciation): uliui story love me love | ||||
---|---|---|---|---|
GT | ||||
Latent Generator w/o reg. | Latent Generator w/ KL-reg. | Latent Generator w/ RVQ-reg. (proposed) |
Script: 오늘은 무얼하고 놀까요
(Pronunciation): oneul-eun mueolhago nolkkayo | ||||
---|---|---|---|---|
GT | ||||
Latent Generator w/o reg. | Latent Generator w/ KL-reg. | Latent Generator w/ RVQ-reg. (proposed) |
Script: 왜 이리 찬란해 왜 또 나는 너를
(Pronunciation): wae ili chanlanhae wae tto naneun neoleul | ||||
---|---|---|---|---|
GT | Unlabeld Ratio: 0% | Unlabeld Ratio: 2% | Unlabeld Ratio: 5% | |
Unlabeld Ratio: 10% | Unlabeld Ratio: 20% | Unlabeld Ratio: 50% |
Script: 내 머리속엔 너 밖에 없어
(Pronunciation): nae meolisog-en neo bakk-e eobs-eo | ||||
---|---|---|---|---|
GT | Unlabeld Ratio: 0% | Unlabeld Ratio: 2% | Unlabeld Ratio: 5% | |
Unlabeld Ratio: 10% | Unlabeld Ratio: 20% | Unlabeld Ratio: 50% |
Script: 멋진 어린이가 될래
(Pronunciation): meotjin eolin-iga doellae | ||||
---|---|---|---|---|
GT | Unlabeld Ratio: 0% | Unlabeld Ratio: 2% | Unlabeld Ratio: 5% | |
Unlabeld Ratio: 10% | Unlabeld Ratio: 20% | Unlabeld Ratio: 50% |
Script: 난 다시 나는 널 못 잊어
(Pronunciation): nan dasi naneun neol mos ij-eo | ||||
---|---|---|---|---|
GT | ||||
HiddenSinger | w/o Enhanced Prior Encoder | w/ z_q Generation | w/ Standard Gaussian Prior |
Script: 그대 날 떠나면
(Pronunciation): geudae nal tteonamyeon | ||||
---|---|---|---|---|
GT | ||||
HiddenSinger | w/o Enhanced Prior Encoder | w/ z_q Generation | w/ Standard Gaussian Prior |
Script: 볼 수 있었던 너의 표정
(Pronunciation): bol su iss-eossdeon neoui pyojeong | ||||
---|---|---|---|---|
GT | ||||
HiddenSinger | w/o Enhanced Prior Encoder | w/ z_q Generation | w/ Standard Gaussian Prior |