This paper proposes a new "decompose-and-edit" paradigm for the text-based speech insertion task that facilitates arbitrary-length speech insertion and even full sentence generation. In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody. Specifically, we proposed to represent the global factors by multiple tokens, which are extracted by cross-attention operation and then injected back by link-attention operation. Due to the rich representation of global factors, we manage to achieve high speaker similarity in a zero-shot manner. In addition, we introduce a prosody smoothing task to make the local prosody factor context-aware and therefore achieve satisfactory prosody continuity. We further achieve high voice quality with an adversarial training stage. In the subjective test, our method achieves state-of-the-art performance in both naturalness and similarity.
The proposed framework "RetrieverTTS" is composed of four parts: phoneme encoder, variance adaptor, mel decoder and global factor encoder. The extracted global factors (style, timbre) are represented by m tokens. For style injection, the first token is added to the input of the phoneme encoder. For timbre injection, all the tokens are injected into mel decoder with link attention mechanism.
We add a prosody smoothing task during training. In 50% training samples, we add partially masked ground-truth prosody embeddings on the encoded phonemes before feeding them into the variance adaptor. The masked embeddings are filled with zero. An additional adversarial training stage is introduced to improve the speech quality.
The key idea of global factor extraction in this work is inspired by Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph
We refer to the publicly available code to build the TTS pipeline.
@inproceedings{yin2022retrievertts,
title={RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion},
author={Yin, Dacheng and Tang, Chuanxin and Liu, Yanqing and Wang, Xiaoqiang and Zhao, Zhiyuan and Zhao, Yucheng and Xiong, Zhiwei and Zhao, Sheng and Luo, Chong}
booktitle={Interspeech},
year={2022}
}