A New Model Every Day — until ElevenLabs hires me

The log

The build, one brick at a time.

How XTTS actually works (audio as language)

Neural codecs, RVQ, semantic vs. acoustic tokens, and why TTS is now just an LLM predicting sound — mapped onto every artifact from Day 01.

01THE BUILD

build log

I trained a Polish voice for fun

Fine-tuning XTTS-v2 on one RTX 3090 — the GPU trick, the NaN that ate three runs, and an honest A/B that says it tied with the base model. Play it yourself.

03NEXT

coming up

The evaluation triangle + the zero-shot ceiling

Build the three numbers that define “good” — WER, speaker-similarity (SECS), naturalness (UTMOS) — then measure base XTTS zero-shot on one clean voice.

The roadmap

The brick ladder.

Each rung is one concept, one experiment, one number — stacked on the last. You can't improve what you can't measure, so it starts with measurement.

B0 · Eval harness — WER + speaker-similarity + naturalness. The foundation.
B1 · Zero-shot ceiling — how much voice you get for free from a good reference clip.
B2 · Single-voice data — curate one consistent speaker (✓ found: 4.3 h, clean).
B3 · Fine-tune that voice — and learn exactly when it beats zero-shot.
B4 · Inference knobs — temperature, reference length, multi-reference: free gains.
B5 · Data-scaling curve — 0.5 / 1 / 2 / 4 h: how much data actually buys.
B6 · Architecture swap — XTTS vs. StyleTTS2 vs. F5 vs. VoxCPM2 on the same voice.
B7 · The codec deep-dive — semantic vs. acoustic tokens, the quality ceiling.
B8 · My own voice — record it, clone it, control its prosody.