A reproducible study of synthetic-text detection and model collapse. Six interpretable, dependency-free statistics generalize to unseen language models, yet per-document detection turns out to be a fundamentally different objective from distribution-level collapse monitoring.
A detector can correctly identify synthetic documents while completely failing to detect distributional degeneration.
Flagging an AI document and warning that a corpus is collapsing sound like the same tool. They are not — and different signals answer them.
| Signal / metric | Flag one AI document | Detect corpus collapse |
|---|---|---|
| Perplexity (fixed LM) | Good | Good |
| Per-document detector score | Good | Poor |
| Vocabulary size | Poor | Excellent |
| Distinct n-gram diversity | Poor | Excellent |
The detector is a deliberately simple instrument — the SyntheticTextProbe — not the contribution. "Habsburg AI" names the phenomenon we study (model collapse), after Shumailov et al. (2023).
Each normalized to 0–1 and combined with a fixed weight into a 0–100 synthetic-likelihood. Three are inverted so a higher contribution always means more synthetic. Pure Python standard library.
Canned filler phrases ("it is important to note…") per ~50 tokens.
Fraction of trigrams that are repeats — looping phrasing.
Type-token ratio. Low diversity → more synthetic.
Distinct-trigram ratio. Low → more synthetic.
Sentence-length variation. Uniform lengths → more synthetic.
Overlap with a corpus of known model outputs — catches recycling.
A faithful JavaScript port of the detector's six signals, running entirely in your browser. Paste text or load a preset.
| Held-out | Classifier | Perplexity | Zero-shot |
|---|---|---|---|
| Mistral-chat | 0.981 | 0.965 | 0.784 |
| ChatGPT | 0.976 | 0.968 | 0.848 |
| GPT-4 | 0.907 | 0.868 | 0.734 |
| Cohere | 0.837 | 0.862 | 0.636 |
| GPT-2 | 0.834 | 0.840 | 0.594 |
| Mean | 0.924 | 0.911 | 0.732 |
Mean 0.915 ± 0.006 (95% CI [0.911, 0.920]) vs perplexity 0.903 ± 0.007 — non-overlapping CIs. Paired permutation test: +0.013, wins 10/10 seeds, p = 0.002.


| Detector (zero-shot) | HC3 | RAID |
|---|---|---|
| SyntheticTextProbe (heuristics) | 0.915 | 0.724 |
| GLTR — GPT-2 log-prob | 0.995 | 0.862 |
| GLTR — GPT-2 top-10 rank | 0.995 | 0.848 |
A white-box likelihood detector clearly wins for detection — but it is also per-document, so it is equally blind to collapse. The gap isn't an artifact of a weak detector.
We retrain a model on the text it just generated, generation after generation, and instrument it with the same detector and diversity metrics.

Scaled the corpus to 150 / 1,500 / 15,000 docs on two datasets. Larger corpora collapse more slowly (vocab loss −96% → −32%), but in all six runs the distribution metrics still fall and the detector stays flat — contamination 0% everywhere.
Not just 2019-era distilGPT-2: Qwen2.5-0.5B (2024) LoRA-fine-tuned recursively — reference perplexity 37 → 2.9, vocab −87% — yet the detector stays flat (17.6 → 18.4), contamination 2%.
A 7-feature classifier reaches 0.915 on generators it never trained on, beating heuristics and perplexity alone.
One fixed detector ranges 0.57–0.83: strong on chat models, near chance on base GPT-2.
Predicted it would raise the floor; it didn't (0.895 vs 0.924). Reported with its caveat, not spun away.
Collapse is real (perplexity 69→7) but per-document detection barely moves — a different problem entirely.


A 17-page write-up: detector, baselines, LOMO classifier, ablation, GLTR comparison, scaling, collapse, and a candid threats-to-validity section.
Download PAPER.pdf