Low-latency Real-time Voice Conversion on CPU

Konstantine Sadov, Matthew Hutter, Asara Near Koe AI

0. Contents

Abstract
Demos -- Voice Conversion

1. Abstract

We adapt the architecture of the real-time streaming sound extraction network Waveformer for the purpose of low-latency voice conversion in real time. We achieve this using knowledge distillation by training a larger neural network, RVC, to perform high-quality voice conversion, afterwards using pairs of input voices and their converted waveforms to train a smaller network with a synthetic parallel dataset. This smaller network is combined with a multi-period discriminator derived from VITS to form a generative adversarial network. The resulting network is able to convert voices in an any-to-one manner while streaming with a latency of under 20ms at 16KHz, running nearly 2.8x faster than real-time on a consumer CPU.

2. Demos -- Voice Conversion

Target Speaker	Source speech	Method
Target Speaker	Source speech	QuickVC	RVC	LLVC	LLVC-HFG	LLVC-NC
p_8312	61-70968-0000
	1089-134686-0000
	1221-135766-000
	6829-68769-0000

Our proposed models are in bold