Abstract

In recent years, emotional text-to-speech has shown considerable progress. However, it requires a large amount of labeled data, which is not easily accessible. Even if it is possible to acquire an emotional speech dataset, there is still a limitation in controlling emotion intensity. In this work, we propose a novel method for cross-speaker emotion transfer and manipulation using vector arithmetic in latent style space. By leveraging only a few labeled samples, we generate emotional speech from reading-style speech without losing the speaker identity. Furthermore, emotion strength is readily controllable using a scalar value, providing an intuitive way for users to manipulate speech. Experimental results show the proposed method affords superior performance in terms of expressiveness, naturalness, and controllability, preserving speaker identity.

Emotion Transfer

Note that all target speakers below are emotion-neutral reading style speakers. Along with reading style speakers, we trained our model with a variety of speaking styles such as animation dubbing, whispering or screaming which are atypical for TTS dataset. Averaging style vectors (Style Mean) showed poor performance when trained with our data, often failing to maintain speaker identity, while it worked well in terms of speaker identity when trained with emotion-neutral datasets only. Also, note that generated sentences below are not in the dataset, so contents of a ground truth audio file do not match with the generated file. All the files from the proposed model are generated with alpha (emotion intensity) 1.5 when alpha (emotion intensity) is not described.

Target Emotion Ground Truth Proposed Style Mean
Neutral
Angry  
Sad  
Happy  
Target Emotion Ground Truth Proposed Style Mean
Neutral
Angry  
Sad  
Happy  

One-shot Emotion Transfer

Emotion Proposed
Neutral
Angry
Sad
Happy

Controlling Emotion Strength

Emotion alpha = 0.0 alpha = 0.5 alpha = 1.0 alpha = 1.5 alpha = 2.0
Angry
Sad
Happy

Controlling Emotion (negative direction)

alpha (emotion intensity) is not applied for a neutral sample

Emotion positive (alpha=1.0) negative (alpha=-1.0)
Neutral  
Angry
Sad
Happy

Ablation Study

Emotion Proposed w/o Adv. spkr cls w/o Cycle-consisency loss
Neutral
Angry
Sad
Happy

Long sentence

An example for a long setence generation. Each emotional speech was generated by applying the scaling alpha by 1.0

Emotion Audio
Neutral
Angry
Sad
Happy