Fine-tuning Chatterbox on a Low-Resource Language: 7 Things That Mattered

Resemble AI's Chatterbox Multilingual TTS is one of the few SOTA open-source TTS models with a real MIT license — code and weights — so it's a natural starting point if you want to build a commercially usable text-to-speech for a language that the official model doesn't ship.

I fine-tuned it on Slovak. The 23 supported languages in the official multilingual checkpoint don't include it, and there was nothing comparable in the open-source ecosystem — every other halfway decent multilingual TTS I could find (XTTS-v2, F5-TTS, Fish Speech) ships under non-commercial licenses. So I trained my own and published the weights.

The TTS sits at the end of a larger pipeline I'm building — an end-to-end Slovak video-dubbing tool (Whisper for transcription, a fine-tuned Gemma 3 for translation, MuseTalk for lip-sync) — and Slovak TTS was the missing piece. That's the reason "good enough on a single sentence" wasn't good enough; I needed audio that holds up across hours of generated speech.

The fine-tuning itself isn't the hard part. The hard part is the dozen tiny things that turn a "weights file that loads" into "weights file that produces audio you'd actually ship." Here are the seven that mattered most for me.

This post assumes you already know what fine-tuning a TTS is and have a base Chatterbox setup running. It's the practical-tuning notes I wish someone had written before I started.

1. The base model and your fine-tune disagree on vocab size

Chatterbox uses a sentencepiece text tokenizer, and depending on training data your fine-tune may end up with a different vocab size than the base multilingual checkpoint. If you naively load_state_dict(strict=True) the T3 weights into the base model, you get a shape mismatch.

The fix is to pad or trim the affected matrices (text_emb.weight and text_head.weight) before loading:

state = load_safetensors(my_finetune_path, device="cpu")

target_vocab = model.t3.text_emb.weight.shape[0]
src_vocab = state["text_emb.weight"].shape[0]

if src_vocab > target_vocab:
    state["text_emb.weight"] = state["text_emb.weight"][:target_vocab, :]
    state["text_head.weight"] = state["text_head.weight"][:target_vocab, :]
elif src_vocab < target_vocab:
    pad = target_vocab - src_vocab
    emb_pad = state["text_emb.weight"].mean(dim=0, keepdim=True).repeat(pad, 1)
    head_pad = state["text_head.weight"].mean(dim=0, keepdim=True).repeat(pad, 1)
    state["text_emb.weight"] = torch.cat([state["text_emb.weight"], emb_pad], dim=0)
    state["text_head.weight"] = torch.cat([state["text_head.weight"], head_pad], dim=0)

model.t3.load_state_dict(state, strict=True)

Padding with the row mean instead of zeros gives you a sensible "neutral" embedding for tokens the fine-tune never saw — better than zeros, which produce noise.

2. The reference audio matters more than you think

Chatterbox is a zero-shot voice-cloning model. You give it a few seconds of someone speaking, and the output mimics that voice. Most articles stop there. They don't tell you that whatever noise is in your reference clip will be baked into every generation.

I was using a 5.7-second Common Voice Slovak clip as the reference. The model output was almost perfect — but had a faint low-frequency hum throughout. Same hum that was in the reference, which I hadn't noticed until I heard it stretched across thirty seconds of generated audio.

The fix is to clean the reference before you pass it to the model. Here's the ffmpeg chain I ended up with (based on what's in my production pipeline):

ffmpeg -i reference.wav -af "
  highpass=f=70,
  afftdn=nr=12:nt=w:om=o,
  lowpass=f=11000,
  equalizer=f=6800:t=q:w=1.2:g=-1.5,
  silenceremove=start_periods=1:start_silence=0.04:start_threshold=-50dB,
  areverse,
  silenceremove=start_periods=1:start_silence=0.06:start_threshold=-46dB,
  areverse
" reference_clean.wav

What each stage does:

highpass=70 removes mains hum and low-frequency rumble
afftdn is FFT-based broadband denoising (12 dB reduction is gentle — pushing it higher starts to make speech metallic)
lowpass=11000 cuts hiss above 11 kHz, which Chatterbox doesn't reproduce anyway
The equalizer notch around 6.8 kHz tames sibilance
The pair of silenceremove + areverse blocks trims silence from both ends without complicated edge-case handling

This matters because cloning is essentially a "voice colour transfer" — anything in the reference is part of the cloned voice. Garbage in, garbage out.

3. Generation parameters: defaults vs. tuned

The default model.generate() call works, but you can do better. After regression-testing on Slovak segments, I landed on these:

Param	Default	Tuned for stability	Notes
`exaggeration`	0.5	0.5	Same
`cfg_weight`	0.5	0.5	Same
`temperature`	0.8	0.6	Lower → more stable, less variable
`top_p`	1.0	0.92	Cuts the long tail of low-prob samples
`repetition_penalty`	1.0	1.25	Prevents the model getting stuck on syllables

There's a real trade-off here: tuned parameters produce more consistent, less-likely-to-fail output, but they also flatten the prosody. The voice sounds slightly more monotone. For a production pipeline doing thousands of segments per day where any failure is worse than a slightly less expressive read, tuned wins. For a demo on a model card where you want one perfect take, default temperature with a deterministic seed and a retry loop wins.

RETRY_SEEDS = [42, 0, 123, 7, 99]
min_dur = max(0.3, len(text) / 25.0)

for seed in RETRY_SEEDS:
    torch.manual_seed(seed)
    wav = model.generate(text=text, language_id="sk")
    if wav.shape[-1] / model.sr >= min_dur:
        break

The retry loop catches the cases where the model produces a too-short output (it sometimes EOSes early on hard inputs). If the first seed works, you stop; if not, try another. Five seeds is plenty in practice.

4. The first word can be garbage. Add a warmup prefix.

Generating "Slovenský jazyk je úradný..." would sometimes produce "Zvolenský jazyk..." — the model's first token after the reference would morph. Sometimes the entire first word would be quiet noise.

This is a "warmup" artifact: the model has to transition from the reference voice's prosody to its own generation, and that first ~0.3 s is where it can wobble. Hard words at position zero hit hardest.

Two fixes, both work:

Reword. If "Slovenský" trips the model, start with "Slovenčina" instead. Easy, free, doesn't always work.

Add a warmup prefix. Put a short, easy phrase at the start that the model can land on cleanly:

Vitajte v ukážke.
Slovenčina je úradný jazyk Slovenskej republiky.
...

By the time the model reaches "Slovenčina," it has stabilised. The prefix becomes part of the generation but is short and natural, and you can always trim it in post if you don't want it.

This isn't unique to Chatterbox — most autoregressive TTS models have warmup behaviour at the very start. The fix transfers.

5. The model can't read out individual letter names

I tried generating "Má bohatú gramatiku, sedem pádov a špecifické hlásky ako ô, ľ alebo ŕ." — a sentence that names individual Slovak letters. The model produced nonsense for the letter-name part.

This is a known limitation of TTS models trained on running speech: they rarely see "the letter ô" as a phrase, so they don't know it should be pronounced as a name rather than as the sound itself. The same is true of acronyms ("NDA" gets read as "enda") and units ("20 %" sometimes becomes "two-hundred percent" because of how digit-percent pairs were tokenised in training data).

The fix is text normalisation before TTS. In my pipeline I have a small Slovak-specific preprocessor that rewrites:

20 % → dvadsať percent
Y100 → ypsilon sto
NDA → eN-Dý-Á
Letter names → spelled-out phonetic forms

This lives as a preprocessing layer, not part of the model. It's easier to fix text once than to retrain the model to handle every edge case.

6. `max_new_tokens` matters for long generations

Chatterbox auto-scales the generation budget:

_max_toks = max_new_tokens or min(4096, max(1000, len(text) * 8))

For a 350-character text that gives 2800 tokens, which sounds like plenty (at ~25 Hz token rate that's over 100 seconds of audio). But the model can EOS early on a particular word — even with a budget far above what the text needs. I had a 30-second narrative whose final word ("Amerike") got cut short despite a 4096-token budget.

When you have a long input, set max_new_tokens=4096 explicitly and check for premature EOS in postprocessing. If your output ends mid-word, treat it as a generation failure and retry with a different seed (see #3).

For very long inputs, a more reliable strategy is to chunk into sentences, generate each separately, and concatenate with a short crossfade. Chatterbox doesn't have first-class chunking yet, so you do this at the application layer.

7. Use `prepare_conditionals` separately from `generate`

The Chatterbox API supports passing audio_prompt_path= directly to generate(), but in a production loop where you're generating many segments with the same voice, it's faster to call prepare_conditionals once and reuse:

model.prepare_conditionals(reference_path)

# Now generate many times without re-loading reference each time
for text in texts:
    wav = model.generate(text=text, language_id="sk")

The conditioning extracts speaker-identity features from the reference. Doing it once amortises the cost across the loop. For one-off demos it doesn't matter; for batch jobs it can shave noticeable time.

A related gotcha: if you switch reference voices mid-loop, remember to re-run prepare_conditionals (or null model.conds) before the next generation, or you'll keep cloning the previous voice.

Final recipe

Putting it together — the inference snippet I'd hand to someone starting fresh:

import torch, torchaudio, subprocess
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from safetensors.torch import load_file

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Clean the reference
subprocess.run([
    "ffmpeg", "-y", "-i", "reference.wav",
    "-af",
    "highpass=f=70,afftdn=nr=12,lowpass=f=11000,"
    "equalizer=f=6800:t=q:w=1.2:g=-1.5,"
    "silenceremove=start_periods=1:start_silence=0.04:start_threshold=-50dB,"
    "areverse,silenceremove=start_periods=1:start_silence=0.06:start_threshold=-46dB,"
    "areverse",
    "reference_clean.wav",
], check=True)

# 2. Load base + patch in fine-tune
model = ChatterboxMultilingualTTS.from_pretrained(device=device)
state = load_file("my_finetune_t3.safetensors", device="cpu")
# vocab resize block from #1 here
model.t3.load_state_dict(state, strict=True)
model.t3.to(device).eval()

# 3. Prepare reference once
model.prepare_conditionals("reference_clean.wav")

# 4. Generate with retry
text = "Vitajte v ukážke. " + your_actual_text  # warmup prefix from #4
RETRY_SEEDS = [42, 0, 123, 7, 99]
min_dur = max(0.3, len(text) / 25.0)

wav = None
for seed in RETRY_SEEDS:
    torch.manual_seed(seed)
    with torch.inference_mode():
        candidate = model.generate(text=text, language_id="sk", max_new_tokens=4096)
    if candidate.shape[-1] / model.sr >= min_dur:
        wav = candidate
        break

torchaudio.save("output.wav", wav.cpu(), model.sr)

This is roughly what's running in my pipeline today.

What didn't work

A few things I tried that turned out to be dead ends, in case you're tempted by them:

Aggressive denoising of the reference (FFT noise reduction at -18 dB or higher). Removes hum reliably but starts producing a metallic, "phasey" cloned voice. -12 dB is the sweet spot.
temperature=0.4 or below. Flattens prosody to the point of sounding like a robocall.
Skipping the warmup prefix and trimming the first 0.3 s in post. Works most of the time, but occasionally cuts the start of an actually-clean first word. The prefix is more reliable.
Trying to fix the EOS-early problem with repetition_penalty=1.5 or higher. Did not help; the model's stop decision is upstream of repetition logic.

Wrapping up

If you're fine-tuning Chatterbox on a language it doesn't ship, the biggest things you can do outside the training loop are:

Clean your reference audio
Tune your generation params (or accept the defaults' trade-offs)
Normalise your text before it reaches the model
Use a warmup prefix for the first word
Set max_new_tokens explicitly and have a retry path

The Slovak fine-tune lives at huggingface.co/pekiskol/chatterbox-tts-slovak under MIT — drop in the inference snippet above and it should work out of the box. Feedback (or different language fine-tunes that hit the same gotchas) welcome on the model's HF discussions tab.

If you've discovered other Chatterbox tuning tricks I missed, leave a comment — I'd like to extend this list.

The code in this post is also in the release scripts on the model page.

推荐订阅源

DEV Community