Word-reorder scoring for languages without spaces

A word-reorder question is one of our newer production formats. The user gets a set of word tiles in a scrambled order and arranges them into a correct sentence. We score by comparing the user’s ordered list of tiles to the accept-list of valid orderings authored in the blueprint.

For languages that put spaces between words, this works exactly as you’d expect. For Thai, Japanese, and Mandarin (which don’t), it’s been quietly broken since we added the format. This week we shipped the fix, and on the way we learned something we should have known earlier about tokenization.

The bug

The blueprint for a word-reorder item stores the accept list as a sequence of tokens. For an English item:

acceptList: ["I", "want", "to", "go", "home"]

The matcher compared the user’s tile order to this sequence. Tile boundaries on screen matched token boundaries in the blueprint. Simple.

For Thai, the same item type stored tokens like:

acceptList: ["ฉัน", "อยาก", "กลับ", "บ้าน"]

Tiles on screen showed those four tokens. User arranged them. Matcher compared. That part also worked. The bug was elsewhere.

The bug was that when we generated a blueprint, we ran the source sentence through a tokenizer to split it into tiles. The tokenizer we were using was language-neutral and assumed whitespace as a token boundary. For English, “I want to go home” became ["I", "want", "to", "go", "home"]. For Thai, “ฉันอยากกลับบ้าน” (no spaces in the source) became ["ฉันอยากกลับบ้าน"]. A single tile. A single-tile word-reorder question is, obviously, not a word-reorder question.

We caught this for items where authors had manually inserted spaces in the Thai source. But the items where the source was natural Thai (without spaces) silently degraded to “drag this one tile onto the answer line,” which scored as a 100% correct word-reorder every time.

Why the blueprint-coverage suite didn’t catch it

We have a blueprint-coverage test suite on both iOS and Android that drives every blueprint × every item × every accept-list entry through the production scoring code. The suite confirms the production code accepts what the blueprint says is correct.

For these broken Thai items, the blueprint said “the correct ordering is [single-tile-with-the-whole-sentence],” and the scoring engine correctly accepted that ordering. The suite was green. The suite checks blueprint-vs-engine consistency, not blueprint-vs-pedagogical-intent. Both can be wrong.

This is a real limit of automated testing for content. A schema-valid blueprint with bad pedagogical content is still schema-valid. We had a human review pass that should have caught these, but the Thai items had been reviewed under the assumption that the tile layout shown in the QA preview was the layout the user would see, and the preview tool was using the same broken tokenizer. The bug was self-consistent across the whole pipeline.

The fix

Three changes:

Language-aware tokenizers. For Thai we now use a dictionary-based segmenter. For Japanese a morphological analyzer. For Mandarin a statistical word segmenter. These know how to split a sentence into morphologically reasonable tiles even when there are no spaces.
A per-language minimum tile count check in the blueprint validator. If a word-reorder item generates fewer than three tiles after segmentation, fail validation. Would have caught this at the blueprint level, regardless of what the runtime did.
Re-author the affected Thai, Japanese, and Mandarin word-reorder items using the new tokenizer. About 40 items total across the three languages. Each now produces a sensible number of tiles, and the accept list has been regenerated against the new segmentation.

Shipped in Android v1.1.0 today. The iOS counterpart is in the next iOS release window.

What we’re taking away

A few lessons that will outlive this specific fix:

Tokenization isn’t language-neutral, even when your code looks like it doesn’t care. Anywhere we assume whitespace is a word boundary is a place that breaks on Thai, Japanese, Mandarin, Khmer, Lao, and a few others.
Coverage tests catch consistency, not correctness. A blueprint and a scoring engine that agree on a wrong answer are still wrong. We need at least one human reviewer per language who’s reading the blueprint with the eyes of a learner, not just running the validator.
The QA preview tool shouldn’t share code with the production runtime. When both use the same broken assumption, your reviewer can’t see the bug. We’ve split them. The preview tool now renders tiles using a deliberately different path so it would catch a divergence.

Sorry to anyone whose Thai, Japanese, or Mandarin word-reorder results were trivially correct over the last several weeks. The good news is that those items didn’t weight heavily in your overall CEFR estimate (word-reorder is one production format among several). The fix is live now.