v2: word reorders, minimal pairs, listening — and 46 languages

Today we’re shipping v2 of Language Level Check. It’s the biggest release we’ve cut since launch, and the one we think actually makes the test feel proficient at testing proficiency.

Three things shipped together. An expanded blueprint format with three new question types. An increase to 46 languages. A refreshed UI built around a deeper navy that holds up better in dark mode and in the App Store screenshots.

New question types

The original test had two production formats: multiple choice (for cloze and reading) and constrained typed responses. v2 adds three more.

Word reorder. A jumbled set of word tiles the user arranges into a correct sentence. Useful for testing syntax in ways MC questions can’t.
Minimal pair. Two sentences that differ by one word. The user picks which is grammatical or which a native speaker would actually say. Excellent for collocation and morphology.
Listening with transcript fallback. The original listening items only had audio. v2 attaches a transcript to every listening item that can be revealed if the audio fails or the user wants to read instead.

The new types share scoring code with the existing types. Same weighted CTT scoring engine, same per-skill subscore math, same cut-score lookup. The blueprint schema got an additional production format field but is otherwise compatible with v1 readers (we kept the v1 path alive while we migrated).

46 languages

The previous build supported 37. v2 adds nine: Cantonese, Bengali, Persian, Swahili, Urdu, Norwegian Bokmål, Catalan, Slovak, and Serbian.

Cantonese is the one we’re most nervous about. Cantonese and Mandarin share a writing system but have different grammar and very different spoken forms, and our scoring framework is CEFR with a custom mapping rather than HSK (which is Mandarin-specific). We expect cut scores to drift more than for our other languages while we accumulate data.

A real blueprint schema

v1 blueprints were essentially “an array of items with some metadata.” v2 formalizes them:

TestBlueprint
├── metadata (language, framework, version, item count, question types)
├── readingPassages (keyed by ID, so passages can be shared across items)
├── scoringModel (CEFR weights, cut scores, subscores configuration)
└── items (array of TestItem)
    └── productionSpec (per-item production format + accept list)

The reading-passage indirection is the one we’re happiest about. v1 redundantly stored the passage on every comprehension item. v2 lets multiple items reference one passage by ID, so a single 200-word passage can naturally support 3-4 comprehension items.

The productionSpec.acceptList is the part that took the longest. The accept list is the set of strings the engine will count as correct, including reasonable alternatives, alternate spellings, and dialect variants. Generating a good accept list by hand is hours of work per item. We’re not happy with it, but we haven’t found a better answer yet.

UI refresh

We pulled the test UI through a redesign over the same release window. The new palette (deeper navy as primary, softer cream as background) reads better in screenshots, holds up at small sizes on the language picker, and tested better in dark mode. Aesthetically it’s closer to a textbook than a quiz, which we think is the right cue.

The bigger change is in the question runner. The progress indicator is now anchored to the bottom of the screen rather than the top. “I don’t know” is now an explicit button rather than buried in the option list. The question-type badge is larger and color-coded. Small things individually, but together they noticeably reduced confusion in our usability sessions.

What’s next

Most of our focus for the next release is calibration. We have enough sessions now to start checking whether our cut scores match real-world performance. We also want to add a language request feature so users can tell us which language they want next.

Both in flight. v2 is what shipped today.