Peter Evers

Parkiet

I've been curious about TPU clusters and whether it would be possible to create a state-of-the-art Dutch text-to-speech model without breaking the bank. Spoiler alert: it absolutely is possible, and I've documented the entire journey in the Parkiet GitHub repository.

Audio Samples Comparison

Conversation

Parkiet:

ElevenLabs:

Prompt: "[S1] denk je dat je een open source model kan trainen met weinig geld en middelen? [S2] ja ik denk het wel. [S1] oh ja, hoe dan? [S2] nou kijk maar in de repo op Git Hub of Hugging Face."

Stuttering

Parkiet:

ElevenLabs:

Prompt: "h h et is dus ook mogelijk, om eh ... uhm, heel veel t te st stotteren in een prompt."

Multi-Speaker

Parkiet:

ElevenLabs:

Prompt: "[S1] hoeveel stemmen worden er ondersteund? [S2] nou, uhm, ik denk toch wel meer dan twee. [S3] ja, ja, d dat is het mooie aan dit model. [S4] ja klopt, het ondersteund tot vier verschillende stemmen per prompt."

Laughs

Parkiet:

ElevenLabs:

Prompt: "(laughs) luister, ik heb een mop, wat uhm, drinkt een webdesigner het liefst? [S2] nou ... ? [S1] Earl Grey (laughs) . [S2] (laughs) heel goed."

Voice Clone

Parkiet:

ElevenLabs:

Prompt: "[S1] je hebt maar weining audio nodig om een stem te clonen." "[S1] dit is wel angstaanjagend dat het zo goed werkt"

Original voice sample used for cloning:

ElevenLabs wins on general conversation quality, but we're surprisingly competitive for multi-speaker conversations naturalness, considering this was built by one person with limited resources. On stuttering, both models perform equally well and the voice cloning performance is also on par.

All source code can be found in the Parkiet repository, a comprehensive guide that shows you how to build high-quality TTS models for any language using Google Cloud's free TPU research credits with JAX. From data preparation pipelines to model training, everything is covered.

Why Dutch TTS?

So far most open-source models that are released are only available in English or Chinese. While there are commercial solutions, most open-source options lack the naturalness and quality you'd expect from modern neural speech synthesis. I wanted to see if we could change that using the latest techniques and Google's generous research credits.

The TPU Advantage

TPUs (Tensor Processing Units) are Google's custom chips designed specifically for machine learning workloads. They excel at the matrix operations that power neural networks, especially for training large models. Google has been so generous to offer research credits that make experimentation accessible. Beside the TPUs I've also used my NVIDIA RTX 5090 GPU for the data processing model (whisper-large-v3) and for prototyping the JAX model.

What You'll Find in the Guide

The Parkiet repository contains everything you need:

Complete setup instructions for Google Cloud TPU clusters and how you can port an existing model to JAX
Data preparation pipelines for any language
Training scripts optimized for TPU architectures

The guide walks you through the entire process, from data preparation to training. It's designed to be language-agnostic, so while I focused on Dutch, the same techniques work for any language with sufficient training data.

Results

The final Dutch TTS model achieves near-human quality while being open source. Training took about 2 days on a TPU v4-32 pod, costing roughly $300 in compute credits. It can be much cheaper if you don't transfer any data to Google Cloud until you know which region your TPU will be assigned to ;).

Future Work

I'm GPU poor with tons of ideas. If you have too much compute and don't know what to do with it, hit me up!