Original title: HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
Authors: Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee
This article presents HierSpeech++, an advanced system for speech synthesis without needing extensive data. Existing models often face speed and robustness issues. HierSpeech++ revolutionizes zero-shot speech synthesis, enhancing both text-to-speech and voice conversion. It employs hierarchical frameworks, boosting speech quality and expressiveness. By leveraging a text-to-vec system, it generates self-supervised speech and F0 representations from text cues, ensuring naturalness and speaker similarity. This system introduces an efficient method for enhancing speech quality from 16 kHz to 48 kHz. HierSpeech++ surpasses previous models, including large language models and diffusion-based systems, proving its efficacy in zero-shot synthesis. Impressively, it achieves human-level quality in zero-shot speech synthesis. The study provides audio samples and source code, offering a promising leap in zero-shot speech synthesis technology.
Original article: https://arxiv.org/abs/2311.12454