Production-ready: 37 languages, real audio, and a curated study path

This week we cut what we’re calling the first production-ready build. The earlier release was production in the sense that we shipped it. This one is production in the sense that it does the thing it claims to do, end to end, in every language we ship.

Three things came together for this release: regenerated audio across every listening item, a full pass of hand-vetted learning resources for every language and level, and a redesign of the post-test recommendations so users actually do something with their score.

Audio: every listening item, regenerated

The earlier listening items came out of a TTS pipeline we now consider provisional. Quality varied. Some languages were excellent. Some were stilted. A handful had pronunciation errors that crept past our spot checks. For this release we regenerated everything through a single canonical pipeline and did a per-language listening pass on the output.

We also wrote a transcript fallback for every listening item. Two reasons. Accessibility (some users prefer to read). Reliability (audio is the part of the test most likely to misbehave on a flaky network).

663 learning resources, hand-vetted

This is the work we’re most proud of in this release. Every CEFR level in every supported language now has a curated set of learning resources. Textbooks, podcasts, structured courses, reference sites. Each one hand-checked against the level it claims to serve. The 663 number is the total across all languages and levels.

We use these on the results screen to give you a concrete next step. “You scored at B1 Spanish, here are five things people at B1 Spanish tend to find useful.” Small piece of the test experience, but in our internal sessions it changed the result from feeling like a verdict to feeling like a starting point.

A note on book links: if you click through to a recommended book and buy it, we may earn a small affiliate commission. We chose books we’d recommend anyway, and the commission applies whether you buy the book we linked to or anything else in that browsing session. But you should know it’s there.

Redesigned recommendations

The original recommendations section was small, secondary, and easy to scroll past. After watching how few users engaged with it, we redesigned it. Bigger cards. Clearer per-resource descriptions. A visual hierarchy that puts recommended resources roughly on par with the score itself.

Engagement on resources rose substantially after the redesign. The kind of change that’s invisible in the code review but visible in user behavior.

What didn’t make this release

Two things we expected to ship slipped:

Custom blueprint difficulty curves per language. Still using the default A1=1, A2=2, …, C1=5 weighting for every blueprint. We have data suggesting we should bias the curve harder in some languages, but we want more sessions before we touch this.
Mid-test difficulty adaptation. Considered an adaptive variant where questions get easier or harder based on how you’re doing. Decided against it. The calibration work for adaptive testing done well is significant, and a fixed-form test is much easier to interpret. We may revisit it once we have item response theory in place.

What we learned this week

The audio regeneration touched 144 files across 37 languages. We learned (again) that you can’t trust a single spot-check per language. At least three different listeners per language is the minimum for catching the wrong-stress and wrong-accent errors TTS produces. We’ve written that into our content QA checklist.

Next release we want to widen language coverage to 46. After that, a hard look at cut scores using the data we’ve now started accumulating.