Low-latency Real-time Voice Conversion on CPU
0. Contents
1. Abstract
We adapt the architecture of the real-time streaming sound extraction network Waveformer for the purpose of low-latency voice conversion in real time. We achieve this using knowledge distillation by training a larger neural network, RVC, to perform high-quality voice conversion, afterwards using pairs of input voices and their converted waveforms to train a smaller network with a synthetic parallel dataset. This smaller network is combined with a multi-period discriminator derived from VITS to form a generative adversarial network. The resulting network is able to convert voices in an any-to-one manner while streaming with a latency of under 20ms at 16KHz, running nearly 2.8x faster than real-time on a consumer CPU.
2. Demos -- Voice Conversion
Target Speaker | Source speech | Method | ||||
QuickVC | RVC | LLVC | LLVC-HFG | LLVC-NC | ||
p_8312 | 61-70968-0000 | |||||
1089-134686-0000 | ||||||
1221-135766-000 | ||||||
6829-68769-0000 |