CROSS-SPEAKER EMOTION TRANSFER BY MANIPULATING SPEECH STYLE LATENTS

Abstract

In recent years, emotional text-to-speech has shown considerable progress. However, it requires a large amount of labeled data, which is not easily accessible. Even if it is possible to acquire an emotional speech dataset, there is still a limitation in controlling emotion intensity. In this work, we propose a novel method for cross-speaker emotion transfer and manipulation using vector arithmetic in latent style space. By leveraging only a few labeled samples, we generate emotional speech from reading-style speech without losing the speaker identity. Furthermore, emotion strength is readily controllable using a scalar value, providing an intuitive way for users to manipulate speech. Experimental results show the proposed method affords superior performance in terms of expressiveness, naturalness, and controllability, preserving speaker identity.

Emotion Transfer

Note that all target speakers below are emotion-neutral reading style speakers. Along with reading style speakers, we trained our model with a variety of speaking styles such as animation dubbing, whispering or screaming which are atypical for TTS dataset. Averaging style vectors (Style Mean) showed poor performance when trained with our data, often failing to maintain speaker identity, while it worked well in terms of speaker identity when trained with emotion-neutral datasets only. Also, note that generated sentences below are not in the dataset, so contents of a ground truth audio file do not match with the generated file. All the files from the proposed model are generated with alpha (emotion intensity) 1.5 when alpha (emotion intensity) is not described.