PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling

 

Ji-Sang Hwang, Sang-Hoon Lee, and Seong-Whan Lee

Abstract

Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder that models word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech.


PauseSpeech


Comparative Model


Sentence 1 (p299):

The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drops increases.

GT

ASR: The difference in the rainbow depends considerably upon the size of the drops and the width of the color band increases as the size of the drops increase.

HiFi-GAN (Recon.)

ASR: The difference in the rainbow depends considerably upon the size of the drops and the width of the color band increases as the size of the drops increase.

FastSpeech 2

ASR: The difference in the rainbow depends considerably upon the size of the drops and the width of the colored band increases as the size of the drops increases.

PortaSpeech

ASR: The difference in the rainbow depends considerably upon the size of the drops and the width of the colored band increases as the size of the drops increases.

PauseSpeech

ASR: The difference in the rainbow depends considerably upon the size of the drops and the width of the colored band increases as the size of the drops increases.

Sentence 2 (p238):

Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

GT

ASR: Six spoons of fresh snow peas five thick slabs of blue cheese and maybe a snack for her brother Bob.

HiFi-GAN (Recon.)

ASR: Six spoons of fresh snow peas five thick slabs of blue cheese and maybe a snack for her brother Bob.

FastSpeech 2

ASR: Six spoons of fresh snow peas five thick slabs of blue cheese and maybe a snuck for her brother Bob.

PortaSpeech

ASR: Six spoons of fresh sano peas five thick slabs of blue cheese and maybe a snack for her brother Bub.

PauseSpeech

ASR: Six spoons of fresh snow peas five thick slabs of blue cheese and maybe a snack for her brother Bob.

Sentence 3 (p335):

JOHN Anderson, the former Scottish national coach, is still unearthing talent.

GT

ASR: JOHN Anderson the former Scottish national coach is still unearthing talent.

HiFi-GAN (Recon.)

ASR: JOHN Anderson the former Scottish national coach is still unearthing talent.

FastSpeech 2

ASR: JOHN Enderson the former Scottish national coach is still unearthing talent.

PortaSpeech

ASR: JOHN Anderson the former Scottish national coach is still anearthing talent.

PauseSpeech

ASR: JOHN Anderson the former Scottish national coach is still on erathing talent.

Sentence 4 (p360):

Scotland, is an increasing concern for young people.

GT

ASR: Scotland is a increasing concern for young people.

HiFi-GAN (Recon.)

ASR: Scotland is an increasing concern for young people.

FastSpeech 2

ASR: Scotland is an increasing concern for young people.

PortaSpeech

ASR: Scotland is an increasing concern for young people.

PauseSpeech

ASR: Scotland is an increasing concern for young people.

Sentence 5 (p360):

But the main issue will be the sale of Burger King.

GT

ASR: But the main issue will be the sale of Berger king.

HiFi-GAN (Recon.)

ASR: But the main issue will be the sale of Berger king.

FastSpeech 2

ASR: But the maiden she will be the sail of Berger king.

PortaSpeech

ASR: But the main issue will be the sail of Bergo king.

PauseSpeech

ASR: But the main issue will be the sale of Berger king.

Analysis of Self-supervised Representation


Ablation Study