The build, one brick at a time.
How XTTS actually works (audio as language)
Neural codecs, RVQ, semantic vs. acoustic tokens, and why TTS is now just an LLM predicting sound — mapped onto every artifact from Day 01.
I trained a Polish voice for fun
Fine-tuning XTTS-v2 on one RTX 3090 — the GPU trick, the NaN that ate three runs, and an honest A/B that says it tied with the base model. Play it yourself.
The evaluation triangle + the zero-shot ceiling
Build the three numbers that define “good” — WER, speaker-similarity (SECS), naturalness (UTMOS) — then measure base XTTS zero-shot on one clean voice.
The brick ladder.
Each rung is one concept, one experiment, one number — stacked on the last. You can't improve what you can't measure, so it starts with measurement.
- B0 · Eval harness — WER + speaker-similarity + naturalness. The foundation.
- B1 · Zero-shot ceiling — how much voice you get for free from a good reference clip.
- B2 · Single-voice data — curate one consistent speaker (✓ found: 4.3 h, clean).
- B3 · Fine-tune that voice — and learn exactly when it beats zero-shot.
- B4 · Inference knobs — temperature, reference length, multi-reference: free gains.
- B5 · Data-scaling curve — 0.5 / 1 / 2 / 4 h: how much data actually buys.
- B6 · Architecture swap — XTTS vs. StyleTTS2 vs. F5 vs. VoxCPM2 on the same voice.
- B7 · The codec deep-dive — semantic vs. acoustic tokens, the quality ceiling.
- B8 · My own voice — record it, clone it, control its prosody.