Inside the scoring engine: how 40 questions land your CEFR within half a level

Now that real test sessions are rolling in, we want to write up how the scoring actually works. We get questions about this fairly often. Usually some variant of “I got a B1 last week and a B2 this week, what gives?” This post is the long answer.

The unit of scoring is an item, not a question

Each test has 40 items. An item is one question (multiple choice, reading comp, listening, or a typed/reorder production task) plus metadata that describes:

Target CEFR level for that item: A1 through C1.
Question type, used both for routing to the right UI and for the per-skill subscore breakdown.
Construct: what the item is testing (subject-verb agreement, inferential reading, weak vowel reduction, etc.). We use this for diagnostic notes more than for scoring.

A blueprint specifies the distribution. Roughly even across A1–C1 for most languages, slightly weighted toward the middle since that’s where most users land and where discrimination matters most.

Weighted, not raw

Naive scoring gives one point per correct answer. The problem: a user who gets six A1 questions right and four C1 wrong scores the same as one who gets six C1 right and four A1 wrong. Those two aren’t at the same level.

Our model assigns each CEFR level a weight. The defaults that most blueprints inherit:

A1 = 1, A2 = 2, B1 = 3, B2 = 4, C1 = 5

Your weighted score is the sum of weights for items you got right. The maximum is the sum of weights for every item in the test.

We then look that score up in a per-blueprint table of cut scores mapping ranges to CEFR levels. Why per-blueprint? Because the difficulty profile of our Hungarian B2 items isn’t identical to our Spanish B2 items, even though both are nominally B2. We adjust cuts when the data tells us they’re off.

Fuzzy boundaries

A score that lands close to a cut-score boundary triggers our fuzzy boundary flag. Instead of saying “you are B1,” the result says “you are B1, but you’re right on the edge of B2. Another day could go either way.” That’s more honest than the alternative.

We added this after our own beta testers (mostly us, taking the test in languages we know well) noticed a one-question swing sometimes flipping a level. Hiding that uncertainty in the result would have been wrong.

Per-skill subscores

Alongside the overall level, we compute a subscore for each question type. Vocabulary/grammar (from cloze MC), reading, listening, production. Same weighted approach inside each type, so a B2 production score and a B2 listening score mean roughly the same thing.

This is where the test stops being just a number and starts being useful. Most users have a noticeable gap. Strong reading, weak listening, for example. The recommendations on the results screen lean into that profile.

Early exit

We added an early-exit option after watching enough sessions to notice the failure mode. If you’re well below the lowest level a test is calibrated for, slogging through 40 questions you can’t read is demoralizing and produces no signal we don’t already have. Once we have enough early signal that the test isn’t going to be informative for this user, we offer to wrap up and call it A1 (or below). Unanswered questions count as incorrect, so your level estimate isn’t artificially inflated by bailing out.

What we’d still like to improve

Honest answer: our cut-score tables are calibrated against expert judgment and a thin layer of real-data validation. They’re not psychometrically rigorous. As we accumulate more answered-item data, we want to revisit cuts using item response theory. Model each item’s difficulty and discrimination empirically. That’s on the roadmap, not in the build.

For now: 40 items, weighted scoring, hand-tuned cuts, fuzzy boundaries, per-skill breakdown. That’s the model. If a result ever feels off, the Report Issue button on the results screen is the fastest way to tell us.