Day 01 · Build log

I trained a Polish voice
for fun, on my own server.

No lab, no budget, no team. Just a spare RTX 3090, a pile of open-source Polish audio, and a weekend. Here's the honest play-by-play — including the bug that produced NaN for three runs straight, and the slightly humbling result at the end.

MODEL · XTTS-v2 fine-tune
LANG · Polish 🇵🇱
COMPUTE · 1× RTX 3090
COST · ~a few cents of power
00 · The spark

Big labs think voice is “solved.” That's exactly the opening.

I'd just read a piece about Kyutai / Gradium — four researchers who shipped a real-time conversational voice model before OpenAI and a year before xAI's demo. The thesis stuck with me: audio models are absurdly data-efficient compared to text, training them is ~50× cheaper than frontier LLMs, and renting a GPU is “a few clicks.”

So the obvious question on a Saturday: could I make a decent Polish text-to-speech voice myself? Not train from scratch — that is a lab job — but fine-tune an open model. Turns out the answer is “yes, the plumbing works,” with an asterisk I'll get to.

01 · The ingredients

Free Polish audio is everywhere, if you know where to look.

Text-to-speech doesn't need the whole internet. A good voice needs hours, not trillions of tokens — and Polish has plenty of openly-licensed speech sitting on Hugging Face:

  • Multilingual LibriSpeech (pl) — ~100 h of read audiobooks, CC BY.
  • FLEURS (pl) — small, clean, great as a test set.
  • plus Common Voice, VoxPopuli, M-AILABS… ~250–300 h total, commercially usable.

I wrote a tiny importer that streams a dataset, resamples everything to 24 kHz mono, loudness-normalizes it, and drops it into a standard wav | text manifest. Pulled ~5 hours to start.

5haudio pulled
1459clips imported
329after filtering
$0data cost
02 · The Polish problem

Numbers and dates are where “general” TTS sounds dumb.

If you feed a model “15:30” or “2026 roku” it'll choke. Polish is heavily inflected, so I wrote a normalizer that expands them into properly spoken words before the model ever sees them:

normalize_pl.py
"Spotkanie o 15:30"   → "Spotkanie o piętnaście trzydzieści"
"w 2026 roku"         → "w dwa tysiące dwudziesty szósty roku"
"tel. 123 456 789"    → "telefon jeden dwa trzy cztery…"  # digit-by-digit
"wzrost o 25%"        → "wzrost o dwadzieścia pięć procent"
03 · The server trick

My GPU was busy hosting a 35-billion-parameter chatbot.

The 3090 wasn't idle — it was running a 35B llama-server eating 19 GB of its 24. So the training script does something I'm weirdly proud of: it stops the chatbot to borrow the GPU, trains, and a trap guarantees the chatbot comes back — even if training crashes, gets killed, or the power of friendship fails.

run_on_simp.sh
# bring the chatbot back no matter how we exit
trap 'sudo systemctl start rocky-llama.service' EXIT INT TERM

sudo systemctl stop rocky-llama.service   # free 19 GB
python train/finetune_xtts.py             # borrow the card
# …trap fires here → chatbot restored, GPU handed back

It worked perfectly every single time — survived 47 restart attempts during one run and still came back serving on the right port. The safety net was the easy part. The model was not.

04 · Debugging hell

Then the loss went NaN. And stayed NaN. For three runs.

Training launched, the GPU lit up, steps flew by — and every single loss value was nan. The model was learning precisely nothing.

train.log
STEP: 0/693   | > loss: nan  (nan)
STEP: 50/693  | > loss: nan  (nan)
STEP: 100/693 | > loss: nan  (nan)   # cool. cool cool cool.

The road there was a tour of 2026 dependency archaeology:

  • transformers 5.x had quietly deleted a function the TTS library still imported → pin back to 4.57.6.
  • The TTS library had moved half its classes between releases → chase the imports.
  • A stray rsync flattened a path and I was running a stale copy of my own training script for two whole runs.
the actual culprit

The same forward pass gave a perfectly finite loss on CPU but NaN on GPU. That mismatch was the clue: it wasn't the data — it was fp16. XTTS's GPT decoder overflows in half-precision. Flip training to fp32 and the NaN evaporates.

train.log · after the fix
Mixed precision: False   Precision: float32
STEP: 50/156   | > loss: 0.71
EPOCH 6/11     | > loss: 0.68   # a real number! it lives!
05 · It lives

518 million parameters, eleven epochs, one finite loss.

In fp32 it trained clean — loss falling epoch over epoch — and dropped a 5.3 GB checkpoint. Total wall-clock for the run: a few minutes on the 3090. Here's the very first thing it ever said:

◢ first words
“Dzień dobry. To jest pierwszy test polskiego głosu, wytrenowany na własnym serwerze.”

It speaks Polish. It's intelligible. I was thrilled… until I checked whether it was actually any good.

06 · The honest scoreboard

Did the fine-tune beat the model I started from? Listen.

The real test (per the Kyutai crew's own advice): trust your ears over metrics — but back it with a number. I generated the same sentences from my fine-tune and from stock XTTS, then re-transcribed both with Whisper and measured word error rate.

A / B · the long sentence
◢ my fine-tune
◢ stock xtts
A / B · the greeting
◢ my fine-tune
◢ stock xtts
SentenceMy fine-tuneStock XTTS
greeting0.000.00
question0.000.00
numbers / date0.530.53
long0.110.00

Word error rate (lower = better). The 0.53 on numbers is a measurement artifact — Whisper writes “2026,” my reference says it in words.

verdict

My fine-tune is basically tied with stock XTTS — and slightly worse on the long sentence. Training on 0.7 h of many different speakers didn't teach it a voice; it just smeared the strong base model a little. This is a successful pipeline, not a successful voice.

07 · What I actually learned

The lesson is the fun part.

  • Modern TTS is shockingly accessible. Open weights + free data + one consumer GPU got me a working Polish voice pipeline in a weekend, for the cost of electricity.
  • Fine-tuning isn't free magic. A strong base model is hard to beat with scraps. To actually improve on it you need one consistent speaker and a few clean hours — not a multi-speaker soup.
  • The bugs are 90% of the work — and 100% of the story. fp16 NaNs, deleted library functions, a self-inflicted rsync path. Same as it ever was.
  • Build the safety net first. Best decision all weekend was the trap that always handed the GPU back to the chatbot.
next, on Day 02 →

Before training smarter, I go deep on the theory: how XTTS actually works — audio as tokens, neural codecs, and why voice identity lives in the reference, not the weights. Read the explainer →