RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

1University of Science and Technology of China, 2Microsoft Research Asia, 3Microsoft Azure Speech
InterSpeech 2022

Abstract

This paper proposes a new "decompose-and-edit" paradigm for the text-based speech insertion task that facilitates arbitrary-length speech insertion and even full sentence generation. In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody. Specifically, we proposed to represent the global factors by multiple tokens, which are extracted by cross-attention operation and then injected back by link-attention operation. Due to the rich representation of global factors, we manage to achieve high speaker similarity in a zero-shot manner. In addition, we introduce a prosody smoothing task to make the local prosody factor context-aware and therefore achieve satisfactory prosody continuity. We further achieve high voice quality with an adversarial training stage. In the subjective test, our method achieves state-of-the-art performance in both naturalness and similarity.

Method

The proposed framework "RetrieverTTS" is composed of four parts: phoneme encoder, variance adaptor, mel decoder and global factor encoder. The extracted global factors (style, timbre) are represented by m tokens. For style injection, the first token is added to the input of the phoneme encoder. For timbre injection, all the tokens are injected into mel decoder with link attention mechanism.

We add a prosody smoothing task during training. In 50% training samples, we add partially masked ground-truth prosody embeddings on the encoded phonemes before feeding them into the variance adaptor. The masked embeddings are filled with zero. An additional adversarial training stage is introduced to improve the speech quality.

Audio Samples

Testing Condition

Long Insertion Examples

Example1
[Original Sentence]: They are chiefly formed from combinations of the impressions made in childhood.
[Edited Sentence]: They are chiefly formed from combinations under his success of the impressions made in childhood.
Example2
[Original Sentence]: Is the under side of civilization any less important than the upper side merely because it is deeper and more sombre?
[Edited Sentence 1]: Is the under side of civilization any less important than the upper side of this civilization merely because it is deeper and more sombre?
[Edited Sentence 2]: Is the under side of civilization any less important than the upper side of the very same civilization merely because it is deeper and more sombre?

Full Sentence Generation Examples (Zero-shot TTS)

Example1
[Reference Sentence]: Mr. T. himself had been much abroad, both on business and to see the great continental galleries of paintings.
[Generated Sentence]: He expended quite as much in my education as he could afford in justice to the rest.
[Ground-Truth Recording]: He expended quite as much in my education as he could afford in justice to the rest.
Example2
[Reference Sentence]: Sheriff Jones made several visits unmolested on their part, and without any display of writs or demand for the surrender of alleged offenders on his own.
[Generated Sentence]: Meanwhile a sober second thought had come to Governor Shannon.
[Ground-Truth Recording]: Meanwhile a sober second thought had come to Governor Shannon.

System comparison

Similarity Test

[Text to be generated]: He spoke French perfectly, I have been told, when need was; but delighted usually in talking the broadest Yorkshire.

Reference Voice RetrieverTTS Meta-StyleSpeech

[Text to be generated]: Gradually I knew I was mastering him--then all was blank.

Reference Voice RetrieverTTS Meta-StyleSpeech
Naturalness Test

[Text after insertion]: Recovering his recollection on the instant, instead of sounding an alarm, which might prove fatal to himself, he remained stationary, an attentive observer of the other's motions.

Ground Truth Recording RetrieverTTS Meta-StyleSpeech C.Tang et al.

[Text after insertion]: Miss Taylor was soon starving for human companionship, for the lighter touches of life and some of its warmth and laughter.

Ground Truth Recording RetrieverTTS Meta-StyleSpeech C.Tang et al.

Ablation Study

[Text after insertion]: He was going to leave the room--she followed him, and cried, "But, my Lord, how shall I see again the unhappy object of my treachery?"

Ground Truth Recording RetrieverTTS -adv -prosody-smooth -retriever

Related Links

The key idea of global factor extraction in this work is inspired by Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

We refer to the publicly available code to build the TTS pipeline.

BibTeX

@inproceedings{yin2022retrievertts,
  title={RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion},
  author={Yin, Dacheng and Tang, Chuanxin and Liu, Yanqing and Wang, Xiaoqiang and Zhao, Zhiyuan and Zhao, Yucheng and Xiong, Zhiwei and Zhao, Sheng and Luo, Chong}
  booktitle={Interspeech},
  year={2022}
}