Big labs think voice is “solved.” That's exactly the opening.
I'd just read a piece about Kyutai / Gradium — four researchers who shipped a real-time conversational voice model before OpenAI and a year before xAI's demo. The thesis stuck with me: audio models are absurdly data-efficient compared to text, training them is ~50× cheaper than frontier LLMs, and renting a GPU is “a few clicks.”
So the obvious question on a Saturday: could I make a decent Polish text-to-speech voice myself? Not train from scratch — that is a lab job — but fine-tune an open model. Turns out the answer is “yes, the plumbing works,” with an asterisk I'll get to.
Free Polish audio is everywhere, if you know where to look.
Text-to-speech doesn't need the whole internet. A good voice needs hours, not trillions of tokens — and Polish has plenty of openly-licensed speech sitting on Hugging Face:
- Multilingual LibriSpeech (pl) — ~100 h of read audiobooks, CC BY.
- FLEURS (pl) — small, clean, great as a test set.
- plus Common Voice, VoxPopuli, M-AILABS… ~250–300 h total, commercially usable.
I wrote a tiny importer that streams a dataset, resamples everything to 24 kHz mono, loudness-normalizes it, and drops it into a standard wav | text manifest. Pulled ~5 hours to start.
Numbers and dates are where “general” TTS sounds dumb.
If you feed a model “15:30” or “2026 roku” it'll choke. Polish is heavily inflected, so I wrote a normalizer that expands them into properly spoken words before the model ever sees them:
"Spotkanie o 15:30" → "Spotkanie o piętnaście trzydzieści" "w 2026 roku" → "w dwa tysiące dwudziesty szósty roku" "tel. 123 456 789" → "telefon jeden dwa trzy cztery…" # digit-by-digit "wzrost o 25%" → "wzrost o dwadzieścia pięć procent"
My GPU was busy hosting a 35-billion-parameter chatbot.
The 3090 wasn't idle — it was running a 35B llama-server eating 19 GB of its 24. So the training script does something I'm weirdly proud of: it stops the chatbot to borrow the GPU, trains, and a trap guarantees the chatbot comes back — even if training crashes, gets killed, or the power of friendship fails.
# bring the chatbot back no matter how we exit trap 'sudo systemctl start rocky-llama.service' EXIT INT TERM sudo systemctl stop rocky-llama.service # free 19 GB python train/finetune_xtts.py # borrow the card # …trap fires here → chatbot restored, GPU handed back
It worked perfectly every single time — survived 47 restart attempts during one run and still came back serving on the right port. The safety net was the easy part. The model was not.
Then the loss went NaN. And stayed NaN. For three runs.
Training launched, the GPU lit up, steps flew by — and every single loss value was nan. The model was learning precisely nothing.
STEP: 0/693 | > loss: nan (nan) STEP: 50/693 | > loss: nan (nan) STEP: 100/693 | > loss: nan (nan) # cool. cool cool cool.
The road there was a tour of 2026 dependency archaeology:
- transformers 5.x had quietly deleted a function the TTS library still imported → pin back to 4.57.6.
- The TTS library had moved half its classes between releases → chase the imports.
- A stray rsync flattened a path and I was running a stale copy of my own training script for two whole runs.
The same forward pass gave a perfectly finite loss on CPU but NaN on GPU. That mismatch was the clue: it wasn't the data — it was fp16. XTTS's GPT decoder overflows in half-precision. Flip training to fp32 and the NaN evaporates.
Mixed precision: False Precision: float32 STEP: 50/156 | > loss: 0.71 EPOCH 6/11 | > loss: 0.68 # a real number! it lives!
518 million parameters, eleven epochs, one finite loss.
In fp32 it trained clean — loss falling epoch over epoch — and dropped a 5.3 GB checkpoint. Total wall-clock for the run: a few minutes on the 3090. Here's the very first thing it ever said:
It speaks Polish. It's intelligible. I was thrilled… until I checked whether it was actually any good.
Did the fine-tune beat the model I started from? Listen.
The real test (per the Kyutai crew's own advice): trust your ears over metrics — but back it with a number. I generated the same sentences from my fine-tune and from stock XTTS, then re-transcribed both with Whisper and measured word error rate.
| Sentence | My fine-tune | Stock XTTS |
|---|---|---|
| greeting | 0.00 | 0.00 |
| question | 0.00 | 0.00 |
| numbers / date | 0.53 | 0.53 |
| long | 0.11 | 0.00 |
Word error rate (lower = better). The 0.53 on numbers is a measurement artifact — Whisper writes “2026,” my reference says it in words.
My fine-tune is basically tied with stock XTTS — and slightly worse on the long sentence. Training on 0.7 h of many different speakers didn't teach it a voice; it just smeared the strong base model a little. This is a successful pipeline, not a successful voice.
The lesson is the fun part.
- Modern TTS is shockingly accessible. Open weights + free data + one consumer GPU got me a working Polish voice pipeline in a weekend, for the cost of electricity.
- Fine-tuning isn't free magic. A strong base model is hard to beat with scraps. To actually improve on it you need one consistent speaker and a few clean hours — not a multi-speaker soup.
- The bugs are 90% of the work — and 100% of the story. fp16 NaNs, deleted library functions, a self-inflicted rsync path. Same as it ever was.
- Build the safety net first. Best decision all weekend was the trap that always handed the GPU back to the chatbot.
Before training smarter, I go deep on the theory: how XTTS actually works — audio as tokens, neural codecs, and why voice identity lives in the reference, not the weights. Read the explainer →