Pirating a TTS voice // melanpan.nl

One of the favourite text-to-speech voices at NURDspace is Femke, a Dutch-Flemish sounding voice by Acapella Group. She gets used for announcements, the !say IRC command, yelling at people to put the bins out, that kind of thing. For a long time we were pulling audio from an Android API, which worked well enough that I even started training my own model based on it.

Then the API got shut down. This is the story of how I pirated Femke. Twice.

What is Piper?

Piper is a fast local neural network TTS system from the rhasspy project. It can synthesise a short sentence in milliseconds on a CPU, which makes it perfect for home automation and IRC bots. It also has a reasonable training pipeline and a bunch of pretrained checkpoints to fine-tune from, which is exactly what I needed.

ℹ️

This was all done before development moved to https://github.com/OHF-Voice/piper1-gpl

Version 1: The Android API

Acapella Group used to expose their voices through an Android API endpoint. WillFromAfarDownloader made it straightforward to pull audio from this; feed it text, get Femke wavs back.

The dataset

For the text I used the Dutch CC-100 corpus, which unpacks to a 30GB text file of Dutch sentences. I wrote a filtering script to pull out lines under 160 characters containing words like “computer” and “hack” and of course, a handful of Dutch swear words, because of course people are going to use Femke to swear. This brought it down to a manageable 62MB.

I then outputted everything in LJSpeech format (a CSV with filename and transcription), splitting on full stops to keep sentences short, and filtering out anything under 10 characters. I stopped at 1500 samples.

In hindsight I should have added more sanity checks, some URLs and non-Dutch strings slipped through into the metadata. Oh well!

Training

Training ran on an RTX 3090 under WSL. I had to drop to 16-bit precision because 32-bit was consuming more than 24GB of VRAM. After about 70 epochs it started sounding like coherent Dutch. It worked, but the quality left something to be desired.

Then Acapella Group shut down the API endpoint. Back to square one.

Version 2: Mind Express and a VM Contraption

With the Android API gone, finding a new source of Femke audio took me a while. Acapella Group SDKs were a dead end.

Eventually the answer turned out to be Mind Express, a communication aid for people with speech impairments. Their 30-day trial includes Acapella Group voices, including all the Dutch ones. Perfect!

Getting audio out of it in bulk was the fun part. After attempting to reverse-engineer the Acapella Group DLLs I came up with the following solution.

The setup

I spun up a Windows 10 VM on Proxmox, gave it a virtual soundcard and a passwordless VNC server:

args: -vnc 0.0.0.0:77
audio0: device=intel-hda,driver=none

Inside the VM, Voicemeter Banana creates a virtual audio device that Python can record from. Control of Mind Express itself went through vncdotool, because pyautogui flat out refused to send inputs to it.

I then wrote a small Flask webserver running inside the VM that, on receiving a POST request with text:

Types the text into Mind Express via VNC
Moves the mouse to click Play
Starts recording from the virtual audio device simultaneously
Stops recording after 3 seconds of silence
Trims with ffmpeg
Returns the wav file to the caller

So from the outside it’s just a simple API, POST some text, get a wav back. Not fast, since the audio is generated in real time, but fast enough.

The dataset

I used the same CC-100 corpus and filtering approach as v1. I let the generator run until I had 6111 samples and called it there. For reference, Piper’s own recording studio uses 1150 prompts for Dutch, and the Nathalie model appears to be trained on around 1130. So 6111 is comfortably more than enough!

Training

This time instead of training from scratch, I fine-tuned from Piper’s existing Dutch checkpoint: Nathalie medium following their now existing training documentation. With a batch size of 42 at 32-bit precision, training ran for 6000 steps on the RTX 3090 and finished in about 4 hours.

The difference compared to v1 is night and day.

Version 1:

Version 2:

What’s Next

Now that the pipeline is working, there are a few directions to take it:

Training more voices: not just Acapella Group, but members contributing their own voice for a personalised !say command
Training on well-known Dutch public figures (gather audio, transcribe with Whisper, fine-tune)
Starting an in-space AI band with Femke and Daan doing Dutch cover songs

The NURDspace wiki has more technical detail on both versions: Training Femke (Voice)