Developer log

Trang 2 / 3

Hello, Android: same scoring engine, native shell

Đăng bởi The Language Level Check team

Today we shipped the Android version. From the user’s perspective it’s the same product as the iOS app. Pick a language, take 40 questions, get your CEFR level and a per-skill breakdown. Under the hood it’s a parallel implementation in Kotlin with Jetpack Compose, sharing nothing with the iOS code except the test content itself.

This post is about how we decided what to share and what to rewrite.

What’s shared

One thing, and it’s the important one: the test content. Both apps load the same versioned blueprint JSON files from the same content endpoint. The blueprints describe items, scoring weights, cut scores, and accept lists in a platform-agnostic way. Adding a question or fixing a typo updates both apps simultaneously.

That meant our content pipeline had to be platform-agnostic from the beginning, which forced us to keep the blueprint schema clean. In retrospect this is the architectural decision we’re happiest about. If the apps had drifted to different content formats, we’d be writing every fix twice forever.

What we rewrote

Everything else. Scoring engine, question runner, language picker, results screen, share sheet, settings, analytics integration. All native Kotlin running through Compose.

We considered Kotlin Multiplatform Mobile (KMM) for the scoring engine. Decided against it for three reasons.

The scoring engine is small. A few hundred lines on each platform. Sharing it would have cost us more in build complexity than the duplication does in maintenance.

The scoring logic is also load-bearing. If anything goes wrong, we want to debug it in the language we’re reading. Two clear native implementations beat one shared implementation we have to reason about through a build-system layer.

We have a blueprint-coverage test suite on each platform that drives every blueprint through the production scoring code with every authored answer. If the two engines ever diverge in behavior, the test suite tells us immediately.

So instead of sharing the code, we share the test contract. Both apps run the same coverage suite. As long as both pass, they score the same way for the same input.

What surprised us

  • String localization on Android is more forgiving than on iOS. Android’s strings.xml with <plurals> resource handles plural rules in a way that took us multiple iterations to get right on iOS using String(localized:). We’d been ready for the opposite.
  • Encrypted SharedPreferences caused trouble. We initially stored the device ID through EncryptedSharedPreferences to be safe. Turned out to add startup overhead, runtime crash modes on some OEM Android variants, and a dependency we weren’t sure would be supported long-term. We moved the device ID to plain SharedPreferences for the v1.0.2 patch. The device ID isn’t sensitive and didn’t warrant the complexity.
  • The Play Store listing localizes more aggressively than the App Store one does. We wrote our store description once in English. Play’s review process automatically suggests translated variants for the languages where we publish. The iOS metadata, by contrast, is whatever you upload through your release tooling.
  • First-run crash reports were dominated by emulators and a single OEM. Higher volume of emulator sessions than we’d expected. The OEM in question (we won’t name it) has its own version of WebView that misbehaves with one of our libraries. We worked around it for the v1.0.1 patch.

What’s the same as iOS, on purpose

Some things we deliberately kept identical to the iOS app even though Android idiom would have suggested otherwise:

  • The question runner UI uses the same layout structure on both platforms. There’s a version of this app where we leaned into Material 3 more aggressively, but it would have meant the iOS and Android tests felt different, and a CEFR estimate is supposed to be reproducible across platforms.
  • The early-exit logic is identical, same trigger conditions on both platforms.
  • The fuzzy-boundary indicator on the results screen reads the same and triggers under the same conditions.
  • The Report Issue button is in exactly the same place. We’ll rename it on Android to “Report problem” eventually, but for v1.0 we kept the wording identical for support consistency.

What’s next

The first Android-specific patch is already in flight. The v1.0.1 we mentioned above, with the encrypted-prefs fix and a couple of layout tweaks for foldables. After that we want to focus on Play Store listing quality. There’s a real gap between our App Store and Play Store install rates, and we suspect a good chunk of it is conversion on the store page itself.

Two platforms, one product, one content pipeline. Thanks for installing.

Localizing the marketing site into 38 languages — and what we deliberately didn't translate

Đăng bởi The Language Level Check team

The marketing site shipped initially in English. Deliberate “do the minimum, watch the data” choice. We wanted to know whether non-English traffic was finding us at all before committing to the work of localizing.

The data was clear enough. A significant majority of our web traffic came from non-English-speaking countries, primarily landing on language-specific pages via search for queries in their native language. Search engines were doing a great job of finding us. The site was then making those visitors read English. That’s a translation gap, not a traffic gap.

So this week we shipped a redesign that includes:

  • Full localization of the UI chrome into 38 locales, including the few regional variants (es-MX, pt-PT, en-GB, en-AU) that didn’t end up in the final language picker but exist in the i18n tables for completeness.
  • Locale-prefixed URLs for every page (/ja/languages/japanese, /de/faq, etc.), with proper hreflang metadata so search engines understand the relationship.
  • Localized app screenshots. We generated screenshots in every locale, so the Japanese version of the site shows the Japanese version of the app. The automated capture run across all 38 locales took most of a day.
  • Localized language-name search. When a German user types “Japanisch” into the language picker, they should find Japanese. We built a search index keyed by the localized language name in every supported locale.
  • A small but cute touch: when a Japanese user lands on the homepage, the hero image shows a French test question (because Japanese users testing themselves on a foreign language are probably not testing themselves on Japanese). The mapping is deliberate per locale.

What we deliberately did not translate

Two things stayed in English:

  • The test content itself. Test prompts, options, and reading passages stay in the target language. That’s the entire point of the test. A Spanish A2 question doesn’t become more accessible in Japanese. It stops being a Spanish A2 question.
  • This blog. Posts here are written by our team in English. Translating every post into 33 languages would either slow our release cadence to a crawl or produce machine-translated content we couldn’t quality-check. We added a small banner on non-English versions of the blog explaining this.

The general principle: localize anything that helps a user decide whether to use the product. Don’t localize things where the translation would degrade quality faster than the English original.

What we learned doing i18n on a static site

We built the localization as a thin layer over our existing static-site setup rather than pulling in a third-party i18n framework. A few things that worked well, a few that didn’t.

Worked well:

  • A single TypeScript module with one translation record per locale, accessed through a small useTranslations(locale) helper. We considered per-locale JSON files but the single module gives us autocomplete and type-checking for free.
  • Locale routing driven off a supported-locales list, with each page generating its variants at build time. Adding a locale is roughly one line.
  • Trailing-slash off across the board. Search engines do not need both /languages/japanese and /languages/japanese/ indexed.

Didn’t work well:

  • We let regional variant locales (es-MX, fr-CA) live in the i18n tables for a while before realizing nothing in the UI surfaced them. They’re still there for the future, but we removed them from the language picker because they were confusing users who couldn’t find their language.
  • We had a brief period where the Japanese version of the hero image was the Japanese question screenshot. Which makes no sense. Japanese users would want to see a non-Japanese test screen, because they’re testing themselves on a language other than their own. We fixed this with a per-locale screenshot mapping.

What the data showed afterwards

A few weeks after the localized site shipped, our top entry pages started including locale-prefixed paths (/ru/languages/russian, for example) for the first time. Search engines are now ranking the localized versions for native-language queries. The canonical English versions still rank higher overall in many cases, but the localized share is climbing.

The biggest remaining gap is China, which sends us a surprising volume of traffic but where our App Store routing was, until recently, sending desktop visitors to the US store. That’s a separate post.

Voicing 47 languages: lessons from generating hundreds of TTS files

Đăng bởi The Language Level Check team

We just finished generating the v2 audio across all 47 supported languages. We had a few false starts on the way, and we thought it’d be useful to write up what we settled on and what we learned.

Why this is hard

Listening items are the audio side of the test. Each one is a short utterance (rarely longer than 10 seconds) paired with a comprehension question. The audio has to:

  • Sound like a native speaker, in a natural register, at the right CEFR-level reading speed.
  • Pronounce vocabulary correctly. Especially the words the comprehension question hinges on.
  • Land cleanly. No weird breath at the end, no truncated phonemes, no clicks.
  • Be reproducible. When we add or fix an item, we need to regenerate that specific clip without re-rendering everything else.

The blocking constraint: no single TTS provider produces uniformly good output across 47 languages. Some are excellent at English and German, weak at Thai or Swahili. Some have a native-quality voice for Mandarin but only a robotic one for Cantonese. Multilingual coverage at this scale is, today, a fragmented landscape.

What we settled on

After evaluating several providers, we landed on one as our primary, with a fallback path for the handful of languages it doesn’t yet support well. Its strengths matched our needs: natural prosody, consistent quality across most of our supported languages, reasonable cost at our scale.

The languages where our primary provider doesn’t yet do a great job are the bottleneck for adding new ones. A handful of locales currently render through a fallback path that produces audio we consider acceptable but not great. We’ve made a deliberate decision to defer expanding to those new languages until the audio quality matches what we ship elsewhere, rather than ship listening items we’re not proud of.

The breath problem

Here’s something nobody warned us about. Most TTS providers, given a short utterance, generate a small inhalation at the start or a release of breath at the end. Sometimes both. On a 4-second clip, a 0.3-second breath is 7% of the total runtime, and it sounds wrong. Pulls the listener out of the moment and makes the clip feel synthetic.

We wrote a post-processing pass that:

  1. Trims silence from the start and end of every clip to a small fixed pad.
  2. Cross-fades the very beginning and end to suppress click artifacts.
  3. Normalizes peak loudness to a target so no language clip is dramatically louder or quieter than another.

Our listening items now sound like clips rather than recordings, which is honestly closer to user expectations than “real” recordings would have been. Real recordings would carry ambient room tone and microphone characteristics that vary per language. Clean TTS, normalized, is more consistent.

Voice selection

We picked one voice per language. We considered using multiple speakers per language for variety, and we might yet, but it added a calibration problem: voices have different pacing, and a CEFR-B2 utterance at one speaker’s natural rate isn’t the same listening difficulty as the same words at another’s. Holding the voice constant per language lets the speed and pacing be deliberate per CEFR level.

Where the provider offered a choice, we picked voices that were:

  • Neutral in register. Not too formal, not too casual.
  • Mid-range pitch. Lower and higher pitches both compress worse and read less clearly at low playback volumes.
  • Recognizably native to the target locale. Not a generic “world Spanish” voice for es.

What we’d do differently

Two things we wish we’d decided earlier:

  • Standardize the prompt format we send to TTS. We were passing some clips as plain text and others with markup inconsistently, and it produced subtle quality differences across languages. We now pass every clip through a single normalization step before TTS.
  • Capture audio metadata at generation time. Voice name, model version, render settings, all in a sidecar JSON file. When we regenerate a clip six months later, we want to know exactly what we asked for the first time.

The clips are in production now. If you ever notice one that sounds wrong, the Report Issue button on the question screen sends us the item ID. That’s the fastest way for us to regenerate just that clip.

v2: word reorders, minimal pairs, listening — and 46 languages

Đăng bởi The Language Level Check team

Today we’re shipping v2 of Language Level Check. It’s the biggest release we’ve cut since launch, and the one we think actually makes the test feel proficient at testing proficiency.

Three things shipped together. An expanded blueprint format with three new question types. An increase to 46 languages. A refreshed UI built around a deeper navy that holds up better in dark mode and in the App Store screenshots.

New question types

The original test had two production formats: multiple choice (for cloze and reading) and constrained typed responses. v2 adds three more.

  • Word reorder. A jumbled set of word tiles the user arranges into a correct sentence. Useful for testing syntax in ways MC questions can’t.
  • Minimal pair. Two sentences that differ by one word. The user picks which is grammatical or which a native speaker would actually say. Excellent for collocation and morphology.
  • Listening with transcript fallback. The original listening items only had audio. v2 attaches a transcript to every listening item that can be revealed if the audio fails or the user wants to read instead.

The new types share scoring code with the existing types. Same weighted CTT scoring engine, same per-skill subscore math, same cut-score lookup. The blueprint schema got an additional production format field but is otherwise compatible with v1 readers (we kept the v1 path alive while we migrated).

46 languages

The previous build supported 37. v2 adds nine: Cantonese, Bengali, Persian, Swahili, Urdu, Norwegian Bokmål, Catalan, Slovak, and Serbian.

Cantonese is the one we’re most nervous about. Cantonese and Mandarin share a writing system but have different grammar and very different spoken forms, and our scoring framework is CEFR with a custom mapping rather than HSK (which is Mandarin-specific). We expect cut scores to drift more than for our other languages while we accumulate data.

A real blueprint schema

v1 blueprints were essentially “an array of items with some metadata.” v2 formalizes them:

TestBlueprint
├── metadata (language, framework, version, item count, question types)
├── readingPassages (keyed by ID, so passages can be shared across items)
├── scoringModel (CEFR weights, cut scores, subscores configuration)
└── items (array of TestItem)
    └── productionSpec (per-item production format + accept list)

The reading-passage indirection is the one we’re happiest about. v1 redundantly stored the passage on every comprehension item. v2 lets multiple items reference one passage by ID, so a single 200-word passage can naturally support 3-4 comprehension items.

The productionSpec.acceptList is the part that took the longest. The accept list is the set of strings the engine will count as correct, including reasonable alternatives, alternate spellings, and dialect variants. Generating a good accept list by hand is hours of work per item. We’re not happy with it, but we haven’t found a better answer yet.

UI refresh

We pulled the test UI through a redesign over the same release window. The new palette (deeper navy as primary, softer cream as background) reads better in screenshots, holds up at small sizes on the language picker, and tested better in dark mode. Aesthetically it’s closer to a textbook than a quiz, which we think is the right cue.

The bigger change is in the question runner. The progress indicator is now anchored to the bottom of the screen rather than the top. “I don’t know” is now an explicit button rather than buried in the option list. The question-type badge is larger and color-coded. Small things individually, but together they noticeably reduced confusion in our usability sessions.

What’s next

Most of our focus for the next release is calibration. We have enough sessions now to start checking whether our cut scores match real-world performance. We also want to add a language request feature so users can tell us which language they want next.

Both in flight. v2 is what shipped today.