Spring releases: rating prompts and smarter answer matching

This week we cut two coordinated releases: iOS 2.1.4 and Android 1.0.1. Neither is a flashy headline release. Both fix real things that needed fixing, and one of them solves a problem we’ve been chipping at since v2.

Here’s what shipped, in order of how interesting it is.

Smarter production answer matching

The most engineering-heavy change in this release. We rewrote how the app decides whether a typed production answer counts as correct.

The original matcher was a string-comparison waterfall:

Exact match against any string in the item’s acceptList.
Case-folded exact match.
Trimmed-and-normalized exact match (collapse whitespace, strip punctuation).

That handled the obvious cases (“Hello” matches “hello”), and it failed the slightly-less-obvious ones (“colour” not matching “color”, “they’re” not matching “they are”, typed accents not matching unaccented entries). The accept list grew over time to paper over these. But the accept list has a tail.

The new matcher adds two layers:

Diacritic-insensitive normalization. “café” and “cafe” are equivalent for matching purposes. We apply this only when the item’s productionSpec.scoringNotes say accents aren’t load-bearing (which is most items in most languages).
Semantic-similarity fallback using an on-device embedding model on each platform. If the typed answer isn’t an exact or normalized match, we compute the cosine similarity between the typed answer’s embedding and each accept-list entry’s embedding. If the similarity is high enough, we count the answer correct.

The semantic fallback is the part we’re watching closely. We didn’t turn it on for every item, only for items flagged as semantic_eligible in their productionSpec. Reason: a semantic match is loose. For low-CEFR items where the test is checking a specific phrase, an embedding-based match is the wrong tool. For higher-CEFR items where the test is checking comprehension expressed in any reasonable wording, it’s the right one.

Initial data is encouraging across the languages we’ve re-measured. We’re watching for false positives, answers the matcher now accepts that it shouldn’t. A small number have come in via the Report Issue button so far. All fixed.

Less annoying rating prompt

Apple lets you ask users to rate your app a limited number of times per year. The default place to ask is “after a positive interaction,” which we’d been interpreting as “after every test completion.” Which meant users who took multiple tests in a short window got asked twice, and users who took a single test got asked at the precise moment they wanted to read their result. Neither of those is what Apple means by a positive interaction.

The new logic asks at most once per session, only after a test that produced a level estimate (not after an early exit), and only after the user has dismissed the results screen at least once before. So we’re catching them on a return to results, not a first-look. Prompt cadence and follow-through both moved in the directions we wanted.

We’re still showing the prompt less often than the per-year limit would allow. We might push it up a notch in the next release.

Orphan analytics events

Less interesting but worth a mention. We found a class of analytics events being sent from sessions that crashed before they completed, producing dangling “Test Started” events with no matching “Test Completed” or “Test Early Exit.” The fix flushes a session boundary marker at the right point so our funnel queries aren’t silently undercounting completions.

Most users won’t notice. It’ll make our internal numbers more honest, which over a quarter is enough to matter when we’re making product decisions off them.

Localized release notes

For the first time, the App Store release notes for iOS 2.1.4 went out in every locale we ship listings for. We’d been writing notes in English and hoping. The localized versions go through the same translation pipeline as the rest of the app and are now reviewed alongside the rest of the metadata for each release.

Android v1.0.1 specifics

The Android side of this release is mostly the same. The answer matcher uses the platform-equivalent embedding model. The rating prompt is the same logic adapted to Play’s in-app review API. The analytics fix is identical. We shipped a stack of layout fixes for foldables and very-tall phones that our automated screenshot pipeline had been hiding from us.

What’s next

One large piece of work nearly done: a word-reorder scoring fix for languages that don’t use spaces between words (Thai, Japanese, Mandarin). Large enough to deserve its own post when it ships.