Learning audio AI in public

A new model
every day —
until ElevenLabs
hires me.

I'm teaching myself text-to-speech by shipping one experiment at a time — building on the SOTA the giants already laid, and stacking my own bricks on top. Honest logs, real audio you can play, and real numbers (even when they're humbling). The day the voice is good enough, ElevenLabs can call.

STACK · XTTS-v2 · neural codecs
HARDWARE · 1× RTX 3090
RULE · measure everything, ship daily
The log

The build, one brick at a time.

The roadmap

The brick ladder.

Each rung is one concept, one experiment, one number — stacked on the last. You can't improve what you can't measure, so it starts with measurement.

  • B0 · Eval harness — WER + speaker-similarity + naturalness. The foundation.
  • B1 · Zero-shot ceiling — how much voice you get for free from a good reference clip.
  • B2 · Single-voice data — curate one consistent speaker (✓ found: 4.3 h, clean).
  • B3 · Fine-tune that voice — and learn exactly when it beats zero-shot.
  • B4 · Inference knobs — temperature, reference length, multi-reference: free gains.
  • B5 · Data-scaling curve — 0.5 / 1 / 2 / 4 h: how much data actually buys.
  • B6 · Architecture swap — XTTS vs. StyleTTS2 vs. F5 vs. VoxCPM2 on the same voice.
  • B7 · The codec deep-dive — semantic vs. acoustic tokens, the quality ceiling.
  • B8 · My own voice — record it, clone it, control its prosody.