Day 02 · Explainer

How XTTS
actually works.

Before training a better voice, understand the machine. Here's the whole stack — audio as tokens, neural codecs, and TTS as language modeling — built up from first principles and mapped onto every artifact from Day 01's real run.

READS ON · VALL-E · EnCodec · Mimi
LEVEL · intuition + the why
01 · The one idea

Treat audio as a sequence of tokens.

Raw audio is ~24,000 numbers per second — far too long and continuous to model directly. The breakthrough: compress audio into a short sequence of discrete tokens (like words), then model speech the way LLMs model text — next-token prediction.

That one move is why scrappy teams can beat big labs at audio: once audio is tokens, you inherit everything from the LLM world — transformers, scaling, in-context learning.

waveform encoder quantizer 🎟 tokens decoder waveform
02 · The codec brick

How audio becomes tokens: residual vector quantization.

A neural codec learns to compress each ~80 ms frame of audio into a few discrete codes. It uses Residual Vector Quantization (RVQ) — a stack of codebooks where each one quantizes what the previous one missed:

codebook 1 →coarse shape
codebook 2 →+ residual
codebook 3 →+ finer residual
… codebook N →+ detail

So one frame becomes N tokens. One second ≈ 75 frames × 8 codebooks (EnCodec), or just 12.5 frames × 8 (Kyutai's Mimi). Fewer frames = easier to model = lower latency.

03 · The deep distinction

Semantic vs. acoustic tokens.

Not all tokens carry the same thing — this is the heart of the audio-AI story:

contentcontentcontent ← semantic: what is said (phonetics, meaning)
timbrepitchstyledetail ← acoustic: how it sounds (the voice itself)
semantic (HuBERT/WavLM)acoustic (EnCodec/SoundStream)

Old recipe (AudioLM): predict semantic first (the words) → then acoustic (the sound). Mimi's trick: distill the semantic signal into codebook #1 of the codec, so token 1 = meaning and tokens 2–N = sound — all in one streaming pass. One codec, both jobs.

04 · The acoustic model

VALL-E: a language model over codec tokens.

Once audio is tokens, TTS becomes shockingly simple to state:

text: dzieńdobry 🎙 3s of YOUR voice (tokens) ⟶ predict ⟶ a₁a₂a₃
voice prompttextpredicted audio tokens

The magic: because the reference voice's tokens sit in the prompt, the model continues in that voice. Zero-shot cloning is just in-context learning — the same trick as few-shot prompting an LLM. No per-voice training required.

05 · XTTS, dissected

Every artifact from Day 01, explained.

XTTS-v2 is one GPT modelling a joint sequence — [ speaker conditioning ][ text tokens ][ audio tokens ] — predicting audio tokens it then decodes to sound:

condcond tekst audioaudioaudio →
text_inputs (224-char limit)Front-end: multilingual BPE tokenizer + a [pl] language token.
cond_mels · get_conditioning_latents()Speaker brick: reference clip → mel → conditioning latents (a perceiver resampler) prepended to the GPT. This is the voice.
audio_codes · shape (2, 235), ≤ 1012Codec brick: XTTS's DVAE turns the mel into ~1024-vocab discrete audio tokens.
the 518M-param GPTAcoustic model: a GPT-2-style decoder predicting the next audio token — VALL-E-style.
loss_text_ce + loss_mel_ceIt's trained to predict both the text tokens and the audio tokens in one sequence — that's why there were two losses.
output_sample_rate = 24000Vocoder: audio tokens → a HiFi-GAN-style decoder → 24 kHz waveform.
06 · The key insight

The voice lives in the prompt, not the weights.

why my fine-tune tied with base

In VALL-E/XTTS, voice identity is carried mostly by the reference conditioning — the weights just know “Polish, and how to clone any voice from a prompt.” Day 01's fine-tune nudged the weights with a soup of speakers, but at inference the identity still came from the reference clip, which base XTTS already uses perfectly. So it tied.

Fine-tuning only wins when you (a) lock onto one consistent voice so the weights specialize, (b) adapt prosody/style/domain, or (c) fill a language/accent gap. That's the entire reason the roadmap pivots to a single speaker.

07 · The bugs, explained by theory

Now the war stories make sense.

  • fp16 → NaN: the GPT's attention logits get large; fp16 maxes out at ~65,504 → overflow → infNaN. fp32 (or bf16, which keeps fp32's exponent range) has the headroom. Precision is about range, not just speed.
  • The ≤11 s clip rule: gpt_max_audio_tokens caps the audio-token sequence; longer audio overruns the position embeddings → garbage. Theory dictates the data rule.
  • The 224-char text limit: the same ceiling, on the text-token side.
08 · The map ahead

Two families — and which brick to swap.

When XTTS hits a ceiling, you don't start over — you swap one brick:

  • Autoregressive token-LMs (VALL-E, XTTS, Moshi) — flexible, great cloning, in-context; but slower and can repeat/hallucinate.
  • Flow-matching / diffusion (StyleTTS2, F5-TTS, VoxCPM2) — predict the acoustic representation in parallel or via an ODE → fast, stable, often more natural. Some skip discrete tokens entirely (“tokenizer-free”).
next, on Day 03 →

Theory meets numbers: build the evaluation triangle (WER + speaker-similarity + naturalness) and measure base XTTS zero-shot on one clean voice — the “how much do I get for free?” experiment.