Why we threw out our QA harness and rebuilt from scratch

This is the kind of post you write the day after you do something you wish you hadn’t put off as long as you did.

We spent yesterday deleting the test-content QA infrastructure that had grown up around our blueprints over the last two months. Replaced every blueprint with a placeholder. Archived the lot to a long-form notes branch. Today we’re starting the content pass over with a clean slate.

This post is about why.

What we had

The original QA system was a Python toolkit that ran a battery of checks over every blueprint:

Schema validation (the easy part).
Per-item answer validation: run the production scoring engine against the blueprint’s authored answer, flag if the engine doesn’t accept it.
Per-language consistency checks: do the CEFR weights sum to the documented max possible score, do the cut scores cover the score range without gaps.
Distractor analysis: look at multiple-choice items and flag ones where one option was obviously trap-y, where two were near-duplicates, where the correct answer was the shortest or the longest by a suspicious margin.
A “phonetic trap” detector for listening items that flagged any minimal pairs we hadn’t deliberately authored as such.
Auto-fixes for a handful of common issues, run as part of the same pipeline.

It was honestly a really nice system. The problem was that we kept making it nicer.

How it went wrong

The QA toolkit was easier to extend than the blueprints were. Adding a new check was a few lines of Python. Fixing the things the check flagged across 37 languages was hours of manual work. So the queue grew.

By month two, the toolkit had 14 different checks, most with at least one false positive per language. The reports were so noisy we’d effectively stopped reading them. We had a script that auto-applied auto-fixes, and the auto-fixes had their own subtle bugs that we then patched around with more checks. The thing that was supposed to ensure quality had become a thing we were maintaining instead of using.

The real signal something was wrong: when a content question came up (“is this Hungarian B2 item actually B2?”), nobody on the team thought to run the QA toolkit. We just opened the blueprint and read it.

The decision

We could have kept patching. The right call was to stop, archive the work for reference, and rebuild what we actually needed.

What we actually need:

Schema validation. Cheap, useful, mandatory. The new pipeline runs this on every commit.
A blueprint-coverage test suite. Drive every blueprint through the production scoring engine with every authored answer, and assert that the engine accepts what the blueprint says is correct. This is the only test we trust automatically, because it uses the production code path, not a parallel re-implementation.
Human review for everything else. Distractor quality, level appropriateness, listening transcript accuracy. These aren’t checks we’re willing to automate away again.

The placeholder blueprints we put in place today are minimal scaffolds that pass schema validation but contain no real content. We’re rebuilding from the language-data side, language by language, with the new coverage suite as the only automated gate.

What we learned

Three things worth writing down so we don’t repeat them:

An automated check is only worth keeping if you actually look at its output. If you have to scroll past 200 false positives to find the real one, the check is doing harm.
Auto-fixers should be separate processes from auto-detectors. Running them together means a false-positive detection turns into a wrong-fix commit. Keeping them split would have made the damage recoverable.
The temptation to add another check is much stronger than the temptation to delete an old one. We should have been deleting at the same rate we were adding from week one. We weren’t.

This rebuild costs us some perceived progress in the short term. We expect to make it back fast. The next stretch of content work won’t be slowed down by the toolkit that was supposed to be helping.