Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

1University of Science and Technology of China, 2HKUST, 3Microsoft Research Asia, 4EIT
ICLR 2022

Retriever is a modal-agnostic content-style disentanglement framework.


This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks.


The framework is called "Retriever" because of its dual-retrieval operations: the cross-attention module retrieves style for content-style separation, and the link attention module retrieves content-specific style for data reconstruction.

In this paper, style is defined as permutation invariant (P.I.) information. The first retrieval, serving as style encoder, is permutation invariant to the input tokens, thus blocking any non-P.I. information to pass through it. To prevent style information from leaking into the content representation, we use VQ information bottleneck in the content branch.

Image Domain

Co-Part Segmentation
Interpolate start reference image.
Shape-Appearance Transfer
Interpolate start reference image.
Part-Level Style Transfer

Speech Domain

Zero-Shot Voice Conversion

Librispeech Samples
Source Target Converted

System Comparison
Source Authentic Target AutoVC AdaIN-VC FragmentVC S2VC Retriever(ours)

Ablation Study
Source Authentic Target Too narrow bottleneck Too wide bottleneck AdaIN decoder Retriever

Conversion results of using different number of style token
Source Authentic Target 1 style token 5 style tokens 10 style tokens 60 style tokens

Conversion result of using different number of target utterances during inference
Source Authentic Target 1 target utterance 3 target utterance 5 target utterance 10 target utterance

Related Links


  title   = {Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph},
  author  = {Yin, Dacheng and Ren, Xuanchi and Luo, Chong and Wang, Yuwang, and Xiong, Zhiwei, and Zeng, Wenjun},
  booktitle = {ICLR},
  year    = {2022}