Low-latency Real-time Voice Conversion on CPU

Konstantine Sadov, Matthew Hutter, Asara Near
Koe AI

0. Contents

  1. Abstract
  2. Demos -- Voice Conversion


1. Abstract

We adapt the architecture of the real-time streaming sound extraction network Waveformer for the purpose of low-latency voice conversion in real time. We achieve this using knowledge distillation by training a larger neural network, RVC, to perform high-quality voice conversion, afterwards using pairs of input voices and their converted waveforms to train a smaller network with a synthetic parallel dataset. This smaller network is combined with a multi-period discriminator derived from VITS to form a generative adversarial network. The resulting network is able to convert voices in an any-to-one manner while streaming with a latency of under 20ms at 16KHz, running nearly 2.8x faster than real-time on a consumer CPU.

2. Demos -- Voice Conversion

Target Speaker Source speech Method
QuickVC RVC LLVC LLVC-HFG LLVC-NC
p_8312 61-70968-0000
1089-134686-0000
1221-135766-000
6829-68769-0000
Our proposed models are in bold