Voicing 47 languages: lessons from generating hundreds of TTS files

Pubblicato il da The Language Level Check team

We just finished generating the v2 audio across all 47 supported languages. We had a few false starts on the way, and we thought it’d be useful to write up what we settled on and what we learned.

Why this is hard

Listening items are the audio side of the test. Each one is a short utterance (rarely longer than 10 seconds) paired with a comprehension question. The audio has to:

  • Sound like a native speaker, in a natural register, at the right CEFR-level reading speed.
  • Pronounce vocabulary correctly. Especially the words the comprehension question hinges on.
  • Land cleanly. No weird breath at the end, no truncated phonemes, no clicks.
  • Be reproducible. When we add or fix an item, we need to regenerate that specific clip without re-rendering everything else.

The blocking constraint: no single TTS provider produces uniformly good output across 47 languages. Some are excellent at English and German, weak at Thai or Swahili. Some have a native-quality voice for Mandarin but only a robotic one for Cantonese. Multilingual coverage at this scale is, today, a fragmented landscape.

What we settled on

After evaluating several providers, we landed on one as our primary, with a fallback path for the handful of languages it doesn’t yet support well. Its strengths matched our needs: natural prosody, consistent quality across most of our supported languages, reasonable cost at our scale.

The languages where our primary provider doesn’t yet do a great job are the bottleneck for adding new ones. A handful of locales currently render through a fallback path that produces audio we consider acceptable but not great. We’ve made a deliberate decision to defer expanding to those new languages until the audio quality matches what we ship elsewhere, rather than ship listening items we’re not proud of.

The breath problem

Here’s something nobody warned us about. Most TTS providers, given a short utterance, generate a small inhalation at the start or a release of breath at the end. Sometimes both. On a 4-second clip, a 0.3-second breath is 7% of the total runtime, and it sounds wrong. Pulls the listener out of the moment and makes the clip feel synthetic.

We wrote a post-processing pass that:

  1. Trims silence from the start and end of every clip to a small fixed pad.
  2. Cross-fades the very beginning and end to suppress click artifacts.
  3. Normalizes peak loudness to a target so no language clip is dramatically louder or quieter than another.

Our listening items now sound like clips rather than recordings, which is honestly closer to user expectations than “real” recordings would have been. Real recordings would carry ambient room tone and microphone characteristics that vary per language. Clean TTS, normalized, is more consistent.

Voice selection

We picked one voice per language. We considered using multiple speakers per language for variety, and we might yet, but it added a calibration problem: voices have different pacing, and a CEFR-B2 utterance at one speaker’s natural rate isn’t the same listening difficulty as the same words at another’s. Holding the voice constant per language lets the speed and pacing be deliberate per CEFR level.

Where the provider offered a choice, we picked voices that were:

  • Neutral in register. Not too formal, not too casual.
  • Mid-range pitch. Lower and higher pitches both compress worse and read less clearly at low playback volumes.
  • Recognizably native to the target locale. Not a generic “world Spanish” voice for es.

What we’d do differently

Two things we wish we’d decided earlier:

  • Standardize the prompt format we send to TTS. We were passing some clips as plain text and others with markup inconsistently, and it produced subtle quality differences across languages. We now pass every clip through a single normalization step before TTS.
  • Capture audio metadata at generation time. Voice name, model version, render settings, all in a sidecar JSON file. When we regenerate a clip six months later, we want to know exactly what we asked for the first time.

The clips are in production now. If you ever notice one that sounds wrong, the Report Issue button on the question screen sends us the item ID. That’s the fastest way for us to regenerate just that clip.