Treat audio as a sequence of tokens.
Raw audio is ~24,000 numbers per second — far too long and continuous to model directly. The breakthrough: compress audio into a short sequence of discrete tokens (like words), then model speech the way LLMs model text — next-token prediction.
That one move is why scrappy teams can beat big labs at audio: once audio is tokens, you inherit everything from the LLM world — transformers, scaling, in-context learning.
How audio becomes tokens: residual vector quantization.
A neural codec learns to compress each ~80 ms frame of audio into a few discrete codes. It uses Residual Vector Quantization (RVQ) — a stack of codebooks where each one quantizes what the previous one missed:
So one frame becomes N tokens. One second ≈ 75 frames × 8 codebooks (EnCodec), or just 12.5 frames × 8 (Kyutai's Mimi). Fewer frames = easier to model = lower latency.
Semantic vs. acoustic tokens.
Not all tokens carry the same thing — this is the heart of the audio-AI story:
Old recipe (AudioLM): predict semantic first (the words) → then acoustic (the sound). Mimi's trick: distill the semantic signal into codebook #1 of the codec, so token 1 = meaning and tokens 2–N = sound — all in one streaming pass. One codec, both jobs.
VALL-E: a language model over codec tokens.
Once audio is tokens, TTS becomes shockingly simple to state:
The magic: because the reference voice's tokens sit in the prompt, the model continues in that voice. Zero-shot cloning is just in-context learning — the same trick as few-shot prompting an LLM. No per-voice training required.
Every artifact from Day 01, explained.
XTTS-v2 is one GPT modelling a joint sequence — [ speaker conditioning ][ text tokens ][ audio tokens ] — predicting audio tokens it then decodes to sound:
| text_inputs (224-char limit) | Front-end: multilingual BPE tokenizer + a [pl] language token. |
| cond_mels · get_conditioning_latents() | Speaker brick: reference clip → mel → conditioning latents (a perceiver resampler) prepended to the GPT. This is the voice. |
| audio_codes · shape (2, 235), ≤ 1012 | Codec brick: XTTS's DVAE turns the mel into ~1024-vocab discrete audio tokens. |
| the 518M-param GPT | Acoustic model: a GPT-2-style decoder predicting the next audio token — VALL-E-style. |
| loss_text_ce + loss_mel_ce | It's trained to predict both the text tokens and the audio tokens in one sequence — that's why there were two losses. |
| output_sample_rate = 24000 | Vocoder: audio tokens → a HiFi-GAN-style decoder → 24 kHz waveform. |
The voice lives in the prompt, not the weights.
In VALL-E/XTTS, voice identity is carried mostly by the reference conditioning — the weights just know “Polish, and how to clone any voice from a prompt.” Day 01's fine-tune nudged the weights with a soup of speakers, but at inference the identity still came from the reference clip, which base XTTS already uses perfectly. So it tied.
Fine-tuning only wins when you (a) lock onto one consistent voice so the weights specialize, (b) adapt prosody/style/domain, or (c) fill a language/accent gap. That's the entire reason the roadmap pivots to a single speaker.
Now the war stories make sense.
- fp16 → NaN: the GPT's attention logits get large; fp16 maxes out at ~65,504 → overflow → inf → NaN. fp32 (or bf16, which keeps fp32's exponent range) has the headroom. Precision is about range, not just speed.
- The ≤11 s clip rule: gpt_max_audio_tokens caps the audio-token sequence; longer audio overruns the position embeddings → garbage. Theory dictates the data rule.
- The 224-char text limit: the same ceiling, on the text-token side.
Two families — and which brick to swap.
When XTTS hits a ceiling, you don't start over — you swap one brick:
- Autoregressive token-LMs (VALL-E, XTTS, Moshi) — flexible, great cloning, in-context; but slower and can repeat/hallucinate.
- Flow-matching / diffusion (StyleTTS2, F5-TTS, VoxCPM2) — predict the acoustic representation in parallel or via an ODE → fast, stable, often more natural. Some skip discrete tokens entirely (“tokenizer-free”).
Theory meets numbers: build the evaluation triangle (WER + speaker-similarity + naturalness) and measure base XTTS zero-shot on one clean voice — the “how much do I get for free?” experiment.