Developer log

Side 3 / 3

Why we threw out our QA harness and rebuilt from scratch

Udgivet af The Language Level Check team

This is the kind of post you write the day after you do something you wish you hadn’t put off as long as you did.

We spent yesterday deleting the test-content QA infrastructure that had grown up around our blueprints over the last two months. Replaced every blueprint with a placeholder. Archived the lot to a long-form notes branch. Today we’re starting the content pass over with a clean slate.

This post is about why.

What we had

The original QA system was a Python toolkit that ran a battery of checks over every blueprint:

  • Schema validation (the easy part).
  • Per-item answer validation: run the production scoring engine against the blueprint’s authored answer, flag if the engine doesn’t accept it.
  • Per-language consistency checks: do the CEFR weights sum to the documented max possible score, do the cut scores cover the score range without gaps.
  • Distractor analysis: look at multiple-choice items and flag ones where one option was obviously trap-y, where two were near-duplicates, where the correct answer was the shortest or the longest by a suspicious margin.
  • A “phonetic trap” detector for listening items that flagged any minimal pairs we hadn’t deliberately authored as such.
  • Auto-fixes for a handful of common issues, run as part of the same pipeline.

It was honestly a really nice system. The problem was that we kept making it nicer.

How it went wrong

The QA toolkit was easier to extend than the blueprints were. Adding a new check was a few lines of Python. Fixing the things the check flagged across 37 languages was hours of manual work. So the queue grew.

By month two, the toolkit had 14 different checks, most with at least one false positive per language. The reports were so noisy we’d effectively stopped reading them. We had a script that auto-applied auto-fixes, and the auto-fixes had their own subtle bugs that we then patched around with more checks. The thing that was supposed to ensure quality had become a thing we were maintaining instead of using.

The real signal something was wrong: when a content question came up (“is this Hungarian B2 item actually B2?”), nobody on the team thought to run the QA toolkit. We just opened the blueprint and read it.

The decision

We could have kept patching. The right call was to stop, archive the work for reference, and rebuild what we actually needed.

What we actually need:

  1. Schema validation. Cheap, useful, mandatory. The new pipeline runs this on every commit.
  2. A blueprint-coverage test suite. Drive every blueprint through the production scoring engine with every authored answer, and assert that the engine accepts what the blueprint says is correct. This is the only test we trust automatically, because it uses the production code path, not a parallel re-implementation.
  3. Human review for everything else. Distractor quality, level appropriateness, listening transcript accuracy. These aren’t checks we’re willing to automate away again.

The placeholder blueprints we put in place today are minimal scaffolds that pass schema validation but contain no real content. We’re rebuilding from the language-data side, language by language, with the new coverage suite as the only automated gate.

What we learned

Three things worth writing down so we don’t repeat them:

  • An automated check is only worth keeping if you actually look at its output. If you have to scroll past 200 false positives to find the real one, the check is doing harm.
  • Auto-fixers should be separate processes from auto-detectors. Running them together means a false-positive detection turns into a wrong-fix commit. Keeping them split would have made the damage recoverable.
  • The temptation to add another check is much stronger than the temptation to delete an old one. We should have been deleting at the same rate we were adding from week one. We weren’t.

This rebuild costs us some perceived progress in the short term. We expect to make it back fast. The next stretch of content work won’t be slowed down by the toolkit that was supposed to be helping.

Production-ready: 37 languages, real audio, and a curated study path

Udgivet af The Language Level Check team

This week we cut what we’re calling the first production-ready build. The earlier release was production in the sense that we shipped it. This one is production in the sense that it does the thing it claims to do, end to end, in every language we ship.

Three things came together for this release: regenerated audio across every listening item, a full pass of hand-vetted learning resources for every language and level, and a redesign of the post-test recommendations so users actually do something with their score.

Audio: every listening item, regenerated

The earlier listening items came out of a TTS pipeline we now consider provisional. Quality varied. Some languages were excellent. Some were stilted. A handful had pronunciation errors that crept past our spot checks. For this release we regenerated everything through a single canonical pipeline and did a per-language listening pass on the output.

We also wrote a transcript fallback for every listening item. Two reasons. Accessibility (some users prefer to read). Reliability (audio is the part of the test most likely to misbehave on a flaky network).

663 learning resources, hand-vetted

This is the work we’re most proud of in this release. Every CEFR level in every supported language now has a curated set of learning resources. Textbooks, podcasts, structured courses, reference sites. Each one hand-checked against the level it claims to serve. The 663 number is the total across all languages and levels.

We use these on the results screen to give you a concrete next step. “You scored at B1 Spanish, here are five things people at B1 Spanish tend to find useful.” Small piece of the test experience, but in our internal sessions it changed the result from feeling like a verdict to feeling like a starting point.

A note on book links: if you click through to a recommended book and buy it, we may earn a small affiliate commission. We chose books we’d recommend anyway, and the commission applies whether you buy the book we linked to or anything else in that browsing session. But you should know it’s there.

Redesigned recommendations

The original recommendations section was small, secondary, and easy to scroll past. After watching how few users engaged with it, we redesigned it. Bigger cards. Clearer per-resource descriptions. A visual hierarchy that puts recommended resources roughly on par with the score itself.

Engagement on resources rose substantially after the redesign. The kind of change that’s invisible in the code review but visible in user behavior.

What didn’t make this release

Two things we expected to ship slipped:

  • Custom blueprint difficulty curves per language. Still using the default A1=1, A2=2, …, C1=5 weighting for every blueprint. We have data suggesting we should bias the curve harder in some languages, but we want more sessions before we touch this.
  • Mid-test difficulty adaptation. Considered an adaptive variant where questions get easier or harder based on how you’re doing. Decided against it. The calibration work for adaptive testing done well is significant, and a fixed-form test is much easier to interpret. We may revisit it once we have item response theory in place.

What we learned this week

The audio regeneration touched 144 files across 37 languages. We learned (again) that you can’t trust a single spot-check per language. At least three different listeners per language is the minimum for catching the wrong-stress and wrong-accent errors TTS produces. We’ve written that into our content QA checklist.

Next release we want to widen language coverage to 46. After that, a hard look at cut scores using the data we’ve now started accumulating.

Inside the scoring engine: how 40 questions land your CEFR within half a level

Udgivet af The Language Level Check team

Now that real test sessions are rolling in, we want to write up how the scoring actually works. We get questions about this fairly often. Usually some variant of “I got a B1 last week and a B2 this week, what gives?” This post is the long answer.

The unit of scoring is an item, not a question

Each test has 40 items. An item is one question (multiple choice, reading comp, listening, or a typed/reorder production task) plus metadata that describes:

  • Target CEFR level for that item: A1 through C1.
  • Question type, used both for routing to the right UI and for the per-skill subscore breakdown.
  • Construct: what the item is testing (subject-verb agreement, inferential reading, weak vowel reduction, etc.). We use this for diagnostic notes more than for scoring.

A blueprint specifies the distribution. Roughly even across A1–C1 for most languages, slightly weighted toward the middle since that’s where most users land and where discrimination matters most.

Weighted, not raw

Naive scoring gives one point per correct answer. The problem: a user who gets six A1 questions right and four C1 wrong scores the same as one who gets six C1 right and four A1 wrong. Those two aren’t at the same level.

Our model assigns each CEFR level a weight. The defaults that most blueprints inherit:

  • A1 = 1, A2 = 2, B1 = 3, B2 = 4, C1 = 5

Your weighted score is the sum of weights for items you got right. The maximum is the sum of weights for every item in the test.

We then look that score up in a per-blueprint table of cut scores mapping ranges to CEFR levels. Why per-blueprint? Because the difficulty profile of our Hungarian B2 items isn’t identical to our Spanish B2 items, even though both are nominally B2. We adjust cuts when the data tells us they’re off.

Fuzzy boundaries

A score that lands close to a cut-score boundary triggers our fuzzy boundary flag. Instead of saying “you are B1,” the result says “you are B1, but you’re right on the edge of B2. Another day could go either way.” That’s more honest than the alternative.

We added this after our own beta testers (mostly us, taking the test in languages we know well) noticed a one-question swing sometimes flipping a level. Hiding that uncertainty in the result would have been wrong.

Per-skill subscores

Alongside the overall level, we compute a subscore for each question type. Vocabulary/grammar (from cloze MC), reading, listening, production. Same weighted approach inside each type, so a B2 production score and a B2 listening score mean roughly the same thing.

This is where the test stops being just a number and starts being useful. Most users have a noticeable gap. Strong reading, weak listening, for example. The recommendations on the results screen lean into that profile.

Early exit

We added an early-exit option after watching enough sessions to notice the failure mode. If you’re well below the lowest level a test is calibrated for, slogging through 40 questions you can’t read is demoralizing and produces no signal we don’t already have. Once we have enough early signal that the test isn’t going to be informative for this user, we offer to wrap up and call it A1 (or below). Unanswered questions count as incorrect, so your level estimate isn’t artificially inflated by bailing out.

What we’d still like to improve

Honest answer: our cut-score tables are calibrated against expert judgment and a thin layer of real-data validation. They’re not psychometrically rigorous. As we accumulate more answered-item data, we want to revisit cuts using item response theory. Model each item’s difficulty and discrimination empirically. That’s on the roadmap, not in the build.

For now: 40 items, weighted scoring, hand-tuned cuts, fuzzy boundaries, per-skill breakdown. That’s the model. If a result ever feels off, the Report Issue button on the results screen is the fastest way to tell us.

How we got here: launching Language Level Check

Udgivet af The Language Level Check team

There’s no shortage of language tests on the internet. Official ones take a week to schedule, cost real money, and tell you a level you’ll only doubt later. Short quizzes scattered across language-learning blogs fit on a single page and feel about as rigorous as a magazine personality quiz. We wanted something in between. Short enough that you’d actually take it. Structured enough that the result meant something the next morning.

That’s what we’ve been working on for the last few months, and today we’re turning it loose. Language Level Check is an iOS app that gives you an estimated CEFR level (or JLPT, HSK, or TOPIK, depending on the language) after about ten minutes of answering 40 questions.

What we wanted to fix

Three things bother us about most online level tests.

First, they’re not calibrated. A question is just a question. There’s no concept of how hard it is or how diagnostic it is at a given level. You get a number out, but the number is more vibe than measurement.

Second, they only test one thing. Usually vocabulary, sometimes grammar, almost never reading or listening. Real proficiency is a profile across skills.

Third, they hand you a number and leave. No “here’s what that level means, here’s what to study next.”

Our approach weights each item by the CEFR level it was authored for (a correct B2 answer counts for more than a correct A1 answer), scores using classical test theory with cut-score boundaries, and produces a result that includes per-skill subscores and concrete next steps.

It’s not as rigorous as a proctored Cambridge exam. We’re not pretending otherwise. It’s the test you take before you decide whether to pay for one of those, or before you tell a tutor where to start.

What’s in the box on day one

  • 37 languages, with more on the way. Every test is hand-authored against a published proficiency framework. CEFR for most. JLPT for Japanese, HSK for Mandarin, TOPIK for Korean.
  • Four question types: fill-in-the-blank multiple choice, reading comprehension, listening (with transcripts as a fallback), and short constrained production (typed answers and word-tile reordering).
  • A weighted scoring engine that produces a CEFR estimate, a per-skill breakdown, and a fuzzy-boundary indicator when your score sits near the edge between two levels.
  • A Report Issue button on every question, because the content’s going to need real-world tuning. The fastest way to find a bad distractor is to ask the person who just took the test.
  • No account required. Your first test is free. A one-time $1.99 unlocks all 46 languages forever. We’re a small team. The product needs to pay for itself without chasing subscription renewals.

What we’re not doing

We’re not building a learning app. We’re not going to teach you Spanish. There are very good apps for that. We’re the diagnostic step before and during. The thing you take when you want to know where you are and what to work on next.

We’ll write here regularly as we add languages, refine the scoring, and learn from real-world test sessions. Thanks for reading. If you try the app, we’d love to hear what we got wrong.