CROSS-SPEAKER EMOTION TRANSFER BY MANIPULATING SPEECH STYLE LATENTS
Abstract
In recent years, emotional text-to-speech has shown considerable progress. However, it requires a large amount of labeled data, which is not easily accessible. Even if it is possible to acquire an emotional speech dataset, there is still a limitation in controlling emotion intensity. In this work, we propose a novel method for cross-speaker emotion transfer and manipulation using vector arithmetic in latent style space. By leveraging only a few labeled samples, we generate emotional speech from reading-style speech without losing the speaker identity. Furthermore, emotion strength is readily controllable using a scalar value, providing an intuitive way for users to manipulate speech. Experimental results show the proposed method affords superior performance in terms of expressiveness, naturalness, and controllability, preserving speaker identity.
Emotion Transfer
Note that all target speakers below are emotion-neutral reading style speakers.
Along with reading style speakers, we trained our model with a variety of speaking styles such as animation dubbing, whispering or screaming which are atypical for TTS dataset.
Averaging style vectors (Style Mean) showed poor performance when trained with our data, often failing to maintain speaker identity, while it worked well in terms of speaker identity when trained with emotion-neutral datasets only.
Also, note that generated sentences below are not in the dataset, so contents of a ground truth audio file do not match with the generated file. All the files from the proposed model are generated with alpha (emotion intensity) 1.5 when alpha (emotion intensity) is not described.
Target Emotion
Ground Truth
Proposed
Style Mean
Neutral
Angry
Sad
Happy
Target Emotion
Ground Truth
Proposed
Style Mean
Neutral
Angry
Sad
Happy
One-shot Emotion Transfer
Emotion
Proposed
Neutral
Angry
Sad
Happy
Controlling Emotion Strength
Emotion
alpha = 0.0
alpha = 0.5
alpha = 1.0
alpha = 1.5
alpha = 2.0
Angry
Sad
Happy
Controlling Emotion (negative direction)
alpha (emotion intensity) is not applied for a neutral sample
Emotion
positive (alpha=1.0)
negative (alpha=-1.0)
Neutral
Angry
Sad
Happy
Ablation Study
Emotion
Proposed
w/o Adv. spkr cls
w/o Cycle-consisency loss
Neutral
Angry
Sad
Happy
Long sentence
An example for a long setence generation. Each emotional speech was generated by applying the scaling alpha by 1.0