Developer log

From US-only to localized: routing visitors to the right App Store

เผยแพร่ โดย The Language Level Check team

A small bug today, with what we think will be a meaningfully large impact.

We were looking through analytics this week and noticed something odd. We were getting consistent web traffic from a handful of non-US countries, but the app installs we were seeing from those same countries were a much smaller fraction than the traffic suggested. The web-to-install ratio for several specific countries was dramatically worse than our overall average.

We’d assumed the gap was because users in those regions were unfamiliar with the brand and bouncing. The actual cause was less interesting and more fixable.

The bug

Every Apple App Store badge on our marketing site, on the per-language landing pages, in the footer, and in the header nav was a link to https://apps.apple.com/app/id6755918623. That URL doesn’t specify a country. On a phone, that doesn’t matter; Apple’s App Store app opens to your local storefront regardless of what the URL says. But on a desktop browser, that URL redirects to the US App Store. Every time.

So when a non-US desktop visitor tapped our localized download button, Apple sent them to the United States App Store, which showed our app listing in English with US dollar pricing. Conversion at that point is what you’d guess.

The badge image was localized. We’d been swapping the App Store badge SVG to the correct locale variant on the corresponding page. The href wasn’t. The visible language of the button matched the user’s locale; the destination was American.

A meaningful share of our web traffic is desktop, and a meaningful share is non-US. The math says this has been quietly costing us installs since launch.

The fix

Two changes today.

First, every store link on the site now includes a locale-specific country code. The Apple URL is now https://apps.apple.com/{country}/app/id6755918623. For Japanese visitors, /jp/. For German, /de/. For Chinese (Simplified), /cn/. And so on across all 33 locales we support. Google Play links similarly include the hl display-language parameter and the gl storefront-country parameter, so a Spanish visitor lands on the Spanish version of the Play Store page with Spanish pricing.

The country codes aren’t always the obvious match to the language code. Some examples where they differ:

  • Czech (cs locale) → Apple cz storefront
  • Greek (el locale) → Apple gr storefront
  • Ukrainian (uk locale) → Apple ua storefront
  • Norwegian Bokmål (nb locale) → Apple no storefront
  • Catalan (ca locale) → Apple es storefront (no separate Catalan store)
  • Arabic (ar locale) → Apple sa storefront (Saudi Arabia, the largest Arabic-speaking App Store)

These are storefront identifiers, not language identifiers. They’re the result of one team member spending an evening cross-checking ISO 3166 codes against Apple’s actual storefront URLs.

Second, the blue “Download App” button in the header now points to a new /download page instead of going straight to the App Store. The /download page shows the same call-to-action block as the homepage: both the Apple and the Google Play badges, presented as an explicit choice rather than a guess based on the user’s device. Visitors who came to the site on a desktop and want the Android version can finally get to it from the header.

What we expect

We expect a measurable lift in download conversion from non-US desktop traffic. The exact size is hard to predict; we haven’t had this instrumented before.

We added a locale property to our outbound store-click events. Going forward, we can see locale-by-locale whether click-through behavior shifts on the new build. We’ll probably write a follow-up post once we’ve had a few weeks to compare.

What we’re kicking ourselves about

This bug was visible in our raw analytics for months. The pattern (high traffic from a country, low installs from that country) was obvious in retrospect. We assumed the cause was something else without checking. The check, in the end, was a single curl command against the unparameterized App Store URL. Three minutes of work. We should have done it in week one.

General lesson: when a metric looks bad and you have a quick explanation in your head, write the explanation down, then go test it. We didn’t, for too long.

Anyway. If you’re reading this from outside the United States and you’ve ever tapped one of our App Store buttons and ended up on a US listing, sorry about that. The button now goes where it says it goes.

What we read in the analytics: 90 days of patterns and what we're acting on

เผยแพร่ โดย The Language Level Check team

We just sat down and read through 90 days of analytics. Some of what we found was expected. Some wasn’t. A few things we’re going to act on this week. This post is the readout, minus the specific numbers we’d rather keep internal.

Where we’re healthy

The in-app funnel is the part we feel best about. Users who start a test mostly finish it. Users who finish are more likely than we’d guessed to share the result with someone. The 10-minute test length tracks well against typical session duration, which suggests we’ve calibrated the time investment about right.

Reading frameworks support the same picture. The languages where we have the most session volume show consistent CEFR-level distributions across users, with the bulk of results landing in the A2-B2 range. That’s where we’d expect a self-selected audience to cluster, and it tells us the scoring isn’t quietly biased toward any one level.

Where conversion is leaking

A few patterns we want to fix:

  • Bounce rate on language landing pages is too high. Visitors arrive via SEO, look briefly, and leave without engaging with the CTA. We think we’re missing two things above the fold. A clearer “what this app actually is” line in one sentence, and social proof of some kind. The fix is mostly a hero redesign, starting with our highest-traffic language page.
  • Trailing-slash URL duplicates split our SEO juice. Google has indexed both /languages/turkish and /languages/turkish/ for almost every language. Visitors are roughly evenly split. Canonical URLs are no-slash (we’ve had trailing slashes disabled at the build level for a while), but the slashed versions are out there ranking. Adding server-side 301 redirects to consolidate.
  • Localized URLs underperform their English equivalents for native-language traffic. A native speaker searching in their own language often still lands on the English version of the page rather than the localized one. The fix is probably stronger hreflang signals and possibly explicit redirects on matching Accept-Language.
  • The App Store routing bug we fixed today. Covered in the previous post. Desktop visitors outside the US were silently redirected to the US App Store regardless of which localized button they clicked.

Surprises

A few patterns we didn’t expect:

  • AI chat assistants are a meaningful referrer. Not just incidental traffic. A real channel, and a growing one. We’re going to write separately about what that implies for how we structure pages and product descriptions, because the optimization story for AI-driven referrals is different from the SEO one.
  • Some smaller markets convert disproportionately well. Tiny sample sizes individually, but the pattern repeats across enough of them to suggest the appetite for language-testing tools is geographically uneven in ways we haven’t been targeting.
  • The share-from-results action gets used more than we expected. People want to send their CEFR estimate to someone (a tutor, a friend, a teacher). We treated the share button as table stakes when we built it. The data says it should be a first-class feature.

Internal-tool tweaks worth mentioning

A few smaller things that came out of the same review:

  • Our in-app rating prompt is currently too conservative. We tightened the logic months back after early complaints about it firing too often. We tightened it too hard. Loosening it is a one-line change for the next release.
  • The in-app affiliate card design is too easy to skip. Click-through is low enough that the cards aren’t earning the screen space they take. Redesign rather than removal.
  • Our analytics events for outbound store clicks didn’t include the visitor’s locale until today’s release. Going forward, every store-click event carries the locale, so we can measure the lift from the localized App Store routing fix country by country.

What we’re acting on this week

In rough priority order:

  1. Watch the localized App Store routing fix land. Measure the lift on outbound store clicks once we have a few weeks of data on the new build.
  2. Add 301 redirects for trailing-slash URLs to consolidate canonical ranking.
  3. Loosen the rating prompt threshold so it shows after a reasonable cadence rather than almost never.
  4. Above-the-fold redesign on the top language landing pages, starting with the highest-traffic one. Clearer one-line value prop, visible social proof, more prominent CTA.
  5. Refresh the in-app affiliate card design to make it less invisible.

We’ll write up the results of each as we ship.

Word-reorder scoring for languages without spaces

เผยแพร่ โดย The Language Level Check team

A word-reorder question is one of our newer production formats. The user gets a set of word tiles in a scrambled order and arranges them into a correct sentence. We score by comparing the user’s ordered list of tiles to the accept-list of valid orderings authored in the blueprint.

For languages that put spaces between words, this works exactly as you’d expect. For Thai, Japanese, and Mandarin (which don’t), it’s been quietly broken since we added the format. This week we shipped the fix, and on the way we learned something we should have known earlier about tokenization.

The bug

The blueprint for a word-reorder item stores the accept list as a sequence of tokens. For an English item:

acceptList: ["I", "want", "to", "go", "home"]

The matcher compared the user’s tile order to this sequence. Tile boundaries on screen matched token boundaries in the blueprint. Simple.

For Thai, the same item type stored tokens like:

acceptList: ["ฉัน", "อยาก", "กลับ", "บ้าน"]

Tiles on screen showed those four tokens. User arranged them. Matcher compared. That part also worked. The bug was elsewhere.

The bug was that when we generated a blueprint, we ran the source sentence through a tokenizer to split it into tiles. The tokenizer we were using was language-neutral and assumed whitespace as a token boundary. For English, “I want to go home” became ["I", "want", "to", "go", "home"]. For Thai, “ฉันอยากกลับบ้าน” (no spaces in the source) became ["ฉันอยากกลับบ้าน"]. A single tile. A single-tile word-reorder question is, obviously, not a word-reorder question.

We caught this for items where authors had manually inserted spaces in the Thai source. But the items where the source was natural Thai (without spaces) silently degraded to “drag this one tile onto the answer line,” which scored as a 100% correct word-reorder every time.

Why the blueprint-coverage suite didn’t catch it

We have a blueprint-coverage test suite on both iOS and Android that drives every blueprint × every item × every accept-list entry through the production scoring code. The suite confirms the production code accepts what the blueprint says is correct.

For these broken Thai items, the blueprint said “the correct ordering is [single-tile-with-the-whole-sentence],” and the scoring engine correctly accepted that ordering. The suite was green. The suite checks blueprint-vs-engine consistency, not blueprint-vs-pedagogical-intent. Both can be wrong.

This is a real limit of automated testing for content. A schema-valid blueprint with bad pedagogical content is still schema-valid. We had a human review pass that should have caught these, but the Thai items had been reviewed under the assumption that the tile layout shown in the QA preview was the layout the user would see, and the preview tool was using the same broken tokenizer. The bug was self-consistent across the whole pipeline.

The fix

Three changes:

  1. Language-aware tokenizers. For Thai we now use a dictionary-based segmenter. For Japanese a morphological analyzer. For Mandarin a statistical word segmenter. These know how to split a sentence into morphologically reasonable tiles even when there are no spaces.

  2. A per-language minimum tile count check in the blueprint validator. If a word-reorder item generates fewer than three tiles after segmentation, fail validation. Would have caught this at the blueprint level, regardless of what the runtime did.

  3. Re-author the affected Thai, Japanese, and Mandarin word-reorder items using the new tokenizer. About 40 items total across the three languages. Each now produces a sensible number of tiles, and the accept list has been regenerated against the new segmentation.

Shipped in Android v1.1.0 today. The iOS counterpart is in the next iOS release window.

What we’re taking away

A few lessons that will outlive this specific fix:

  • Tokenization isn’t language-neutral, even when your code looks like it doesn’t care. Anywhere we assume whitespace is a word boundary is a place that breaks on Thai, Japanese, Mandarin, Khmer, Lao, and a few others.
  • Coverage tests catch consistency, not correctness. A blueprint and a scoring engine that agree on a wrong answer are still wrong. We need at least one human reviewer per language who’s reading the blueprint with the eyes of a learner, not just running the validator.
  • The QA preview tool shouldn’t share code with the production runtime. When both use the same broken assumption, your reviewer can’t see the bug. We’ve split them. The preview tool now renders tiles using a deliberately different path so it would catch a divergence.

Sorry to anyone whose Thai, Japanese, or Mandarin word-reorder results were trivially correct over the last several weeks. The good news is that those items didn’t weight heavily in your overall CEFR estimate (word-reorder is one production format among several). The fix is live now.

Spring releases: rating prompts and smarter answer matching

เผยแพร่ โดย The Language Level Check team

This week we cut two coordinated releases: iOS 2.1.4 and Android 1.0.1. Neither is a flashy headline release. Both fix real things that needed fixing, and one of them solves a problem we’ve been chipping at since v2.

Here’s what shipped, in order of how interesting it is.

Smarter production answer matching

The most engineering-heavy change in this release. We rewrote how the app decides whether a typed production answer counts as correct.

The original matcher was a string-comparison waterfall:

  1. Exact match against any string in the item’s acceptList.
  2. Case-folded exact match.
  3. Trimmed-and-normalized exact match (collapse whitespace, strip punctuation).

That handled the obvious cases (“Hello” matches “hello”), and it failed the slightly-less-obvious ones (“colour” not matching “color”, “they’re” not matching “they are”, typed accents not matching unaccented entries). The accept list grew over time to paper over these. But the accept list has a tail.

The new matcher adds two layers:

  1. Diacritic-insensitive normalization. “café” and “cafe” are equivalent for matching purposes. We apply this only when the item’s productionSpec.scoringNotes say accents aren’t load-bearing (which is most items in most languages).
  2. Semantic-similarity fallback using an on-device embedding model on each platform. If the typed answer isn’t an exact or normalized match, we compute the cosine similarity between the typed answer’s embedding and each accept-list entry’s embedding. If the similarity is high enough, we count the answer correct.

The semantic fallback is the part we’re watching closely. We didn’t turn it on for every item, only for items flagged as semantic_eligible in their productionSpec. Reason: a semantic match is loose. For low-CEFR items where the test is checking a specific phrase, an embedding-based match is the wrong tool. For higher-CEFR items where the test is checking comprehension expressed in any reasonable wording, it’s the right one.

Initial data is encouraging across the languages we’ve re-measured. We’re watching for false positives, answers the matcher now accepts that it shouldn’t. A small number have come in via the Report Issue button so far. All fixed.

Less annoying rating prompt

Apple lets you ask users to rate your app a limited number of times per year. The default place to ask is “after a positive interaction,” which we’d been interpreting as “after every test completion.” Which meant users who took multiple tests in a short window got asked twice, and users who took a single test got asked at the precise moment they wanted to read their result. Neither of those is what Apple means by a positive interaction.

The new logic asks at most once per session, only after a test that produced a level estimate (not after an early exit), and only after the user has dismissed the results screen at least once before. So we’re catching them on a return to results, not a first-look. Prompt cadence and follow-through both moved in the directions we wanted.

We’re still showing the prompt less often than the per-year limit would allow. We might push it up a notch in the next release.

Orphan analytics events

Less interesting but worth a mention. We found a class of analytics events being sent from sessions that crashed before they completed, producing dangling “Test Started” events with no matching “Test Completed” or “Test Early Exit.” The fix flushes a session boundary marker at the right point so our funnel queries aren’t silently undercounting completions.

Most users won’t notice. It’ll make our internal numbers more honest, which over a quarter is enough to matter when we’re making product decisions off them.

Localized release notes

For the first time, the App Store release notes for iOS 2.1.4 went out in every locale we ship listings for. We’d been writing notes in English and hoping. The localized versions go through the same translation pipeline as the rest of the app and are now reviewed alongside the rest of the metadata for each release.

Android v1.0.1 specifics

The Android side of this release is mostly the same. The answer matcher uses the platform-equivalent embedding model. The rating prompt is the same logic adapted to Play’s in-app review API. The analytics fix is identical. We shipped a stack of layout fixes for foldables and very-tall phones that our automated screenshot pipeline had been hiding from us.

What’s next

One large piece of work nearly done: a word-reorder scoring fix for languages that don’t use spaces between words (Thai, Japanese, Mandarin). Large enough to deserve its own post when it ships.