[Zero-Shot TTS, Speech Editing] RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

Abstract

This paper proposes a new "decompose-and-edit" paradigm for the text-based speech insertion task that facilitates arbitrary-length speech insertion and even full sentence generation. In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody. Specifically, we proposed to represent the global factors by multiple tokens, which are extracted by cross-attention operation and then injected back by link-attention operation. Due to the rich representation of global factors, we manage to achieve high speaker similarity in a zero-shot manner. In addition, we introduce a prosody smoothing task to make the local prosody factor context-aware and therefore achieve satisfactory prosody continuity. We further achieve high voice quality with an adversarial training stage. In the subjective test, our method achieves state-of-the-art performance in both naturalness and similarity.

Method

The proposed framework "RetrieverTTS" is composed of four parts: phoneme encoder, variance adaptor, mel decoder and global factor encoder. The extracted global factors (style, timbre) are represented by m tokens. For style injection, the first token is added to the input of the phoneme encoder. For timbre injection, all the tokens are injected into mel decoder with link attention mechanism.

We add a prosody smoothing task during training. In 50% training samples, we add partially masked ground-truth prosody embeddings on the encoded phonemes before feeding them into the variance adaptor. The masked embeddings are filled with zero. An additional adversarial training stage is introduced to improve the speech quality.

Audio Samples

Testing Condition

All the samples in this demo page are spoken by speakers that are out of the training data.
Our model does not apply any finetuning on these unseen speakers' voice.

Long Insertion Examples

Example1

[Original Sentence]: They are chiefly formed from combinations of the impressions made in childhood.
[Edited Sentence]: They are chiefly formed from combinations under his success of the impressions made in childhood.

Example2

[Original Sentence]: Is the under side of civilization any less important than the upper side merely because it is deeper and more sombre?
[Edited Sentence 1]: Is the under side of civilization any less important than the upper side of this civilization merely because it is deeper and more sombre?
[Edited Sentence 2]: Is the under side of civilization any less important than the upper side of the very same civilization merely because it is deeper and more sombre?

Full Sentence Generation Examples (Zero-shot TTS)

Example1

[Reference Sentence]: Mr. T. himself had been much abroad, both on business and to see the great continental galleries of paintings.
[Generated Sentence]: He expended quite as much in my education as he could afford in justice to the rest.
[Ground-Truth Recording]: He expended quite as much in my education as he could afford in justice to the rest.

Example2

[Reference Sentence]: Sheriff Jones made several visits unmolested on their part, and without any display of writs or demand for the surrender of alleged offenders on his own.
[Generated Sentence]: Meanwhile a sober second thought had come to Governor Shannon.
[Ground-Truth Recording]: Meanwhile a sober second thought had come to Governor Shannon.

System comparison

Similarity Test

[Text to be generated]: He spoke French perfectly, I have been told, when need was; but delighted usually in talking the broadest Yorkshire.

Reference Voice	RetrieverTTS	Meta-StyleSpeech

[Text to be generated]: Gradually I knew I was mastering him--then all was blank.

Reference Voice	RetrieverTTS	Meta-StyleSpeech

Naturalness Test

[Text after insertion]: Recovering his recollection on the instant, instead of sounding an alarm, which might prove fatal to himself, he remained stationary, an attentive observer of the other's motions.

Ground Truth Recording	RetrieverTTS	Meta-StyleSpeech	C.Tang et al.

[Text after insertion]: Miss Taylor was soon starving for human companionship, for the lighter touches of life and some of its warmth and laughter.

Ground Truth Recording	RetrieverTTS	Meta-StyleSpeech	C.Tang et al.

Ablation Study

[Text after insertion]: He was going to leave the room--she followed him, and cried, "But, my Lord, how shall I see again the unhappy object of my treachery?"

Ground Truth Recording	RetrieverTTS	-adv	-prosody-smooth	-retriever

BibTeX

@inproceedings{yin2022retrievertts,
  title={RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion},
  author={Yin, Dacheng and Tang, Chuanxin and Liu, Yanqing and Wang, Xiaoqiang and Zhao, Zhiyuan and Zhao, Yucheng and Xiong, Zhiwei and Zhao, Sheng and Luo, Chong}
  booktitle={Interspeech},
  year={2022}
}

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

Abstract

Method

Audio Samples

Testing Condition

Long Insertion Examples

Example1

Example2

Full Sentence Generation Examples (Zero-shot TTS)

Example1

Example2

System comparison

Similarity Test

Naturalness Test

Ablation Study

Related Links

BibTeX