AI-text detection · model collapse · 8 generators

Contamination detection is not collapse monitoring.

A reproducible study of synthetic-text detection and model collapse. Six interpretable, dependency-free statistics generalize to unseen language models, yet per-document detection turns out to be a fundamentally different objective from distribution-level collapse monitoring.

A detector can correctly identify synthetic documents while completely failing to detect distributional degeneration.

0.915
mean AUC on unseen generators (10 seeds, p=0.002 vs perplexity)
8
real LLMs evaluated: GPT-2/3/4, ChatGPT, Llama, Mistral, Cohere, MPT
69 → 7
reference-perplexity collapse of a model retrained on its own output
100%
stdlib core — reproducible from a single fixed seed
The central claim

Two questions that get conflated

Flagging an AI document and warning that a corpus is collapsing sound like the same tool. They are not — and different signals answer them.

AI-Text Detection
operates on ONE document“is this text synthetic?”
Collapse Monitoring
operates on the DISTRIBUTION“is the corpus losing diversity?”
Signal / metricFlag one AI documentDetect corpus collapse
Perplexity (fixed LM)GoodGood
Per-document detector scoreGoodPoor
Vocabulary sizePoorExcellent
Distinct n-gram diversityPoorExcellent

The detector is a deliberately simple instrument — the SyntheticTextProbe — not the contribution. "Habsburg AI" names the phenomenon we study (model collapse), after Shumailov et al. (2023).

The instrument

Six interpretable signals

Each normalized to 0–1 and combined with a fixed weight into a 0–100 synthetic-likelihood. Three are inverted so a higher contribution always means more synthetic. Pure Python standard library.

w 0.20

Filler density

Canned filler phrases ("it is important to note…") per ~50 tokens.

w 0.18

Repetition

Fraction of trigrams that are repeats — looping phrasing.

w 0.16

Lexical diversity

Type-token ratio. Low diversity → more synthetic.

w 0.16

N-gram diversity

Distinct-trigram ratio. Low → more synthetic.

w 0.15

Burstiness

Sentence-length variation. Uniform lengths → more synthetic.

w 0.15

Duplicate similarity

Overlap with a corpus of known model outputs — catches recycling.

Live demo

Score your own text

A faithful JavaScript port of the detector's six signals, running entirely in your browser. Paste text or load a preset.

Human-style text AI-style text Clear
0
synthetic-likelihood / 100
Results · real data

Detection on real human-vs-LLM text

The headline — generalizing to unseen generators

Leave-one-model-out AUC on the held-out generator (reference seed 1234).
Held-outClassifierPerplexityZero-shot
Mistral-chat0.9810.9650.784
ChatGPT0.9760.9680.848
GPT-40.9070.8680.734
Cohere0.8370.8620.636
GPT-20.8340.8400.594
Mean0.9240.9110.732

Significant across 10 seeds

Mean 0.915 ± 0.006 (95% CI [0.911, 0.920]) vs perplexity 0.903 ± 0.007 — non-overlapping CIs. Paired permutation test: +0.013, wins 10/10 seeds, p = 0.002.

Leave-one-model-out AUC
The trained classifier generalizes to unseen generators, beating perplexity alone and the zero-shot heuristic.

An honest comparison — GLTR beats the heuristics

Cross-model AUC
One fixed zero-shot detector vs eight generators: strong on chat models, near chance on GPT-2 base completions.
Zero-shot AUC. DetectGPT-curvature & GPTZero aren't run (white-box / paid API).
Detector (zero-shot)HC3RAID
SyntheticTextProbe (heuristics)0.9150.724
GLTR — GPT-2 log-prob0.9950.862
GLTR — GPT-2 top-10 rank0.9950.848

This strengthens the thesis

A white-box likelihood detector clearly wins for detection — but it is also per-document, so it is equally blind to collapse. The gap isn't an artifact of a weak detector.

Model collapse

Watching a model eat its own output

We retrain a model on the text it just generated, generation after generation, and instrument it with the same detector and diversity metrics.

Neural collapse curves
Neural collapse (distilGPT-2). Left: diversity falls and the detector creeps up. Right: reference perplexity crashes 69 → 7 — the cleanest collapse signal in the project.

Holds at 100× scale

Scaled the corpus to 150 / 1,500 / 15,000 docs on two datasets. Larger corpora collapse more slowly (vocab loss −96% → −32%), but in all six runs the distribution metrics still fall and the detector stays flat — contamination 0% everywhere.

Holds for a modern LLM

Not just 2019-era distilGPT-2: Qwen2.5-0.5B (2024) LoRA-fine-tuned recursively — reference perplexity 37 → 2.9, vocab −87% — yet the detector stays flat (17.6 → 18.4), contamination 2%.

Honest findings

What's true — including where the prediction was wrong

✓ Cheap features generalize

A 7-feature classifier reaches 0.915 on generators it never trained on, beating heuristics and perplexity alone.

~ Zero-shot transfer is uneven

One fixed detector ranges 0.57–0.83: strong on chat models, near chance on base GPT-2.

✗ GPT-2 perplexity didn't win

Predicted it would raise the floor; it didn't (0.895 vs 0.924). Reported with its caveat, not spun away.

✗ The detector misses collapse

Collapse is real (perplexity 69→7) but per-document detection barely moves — a different problem entirely.

Figures

Gallery

ROC on HC3 (real human vs ChatGPT).
Collapse at scale: diversity falls at every size; detector stays flat.
Paper & reproducibility

Read it, run it

The paper

A 17-page write-up: detector, baselines, LOMO classifier, ablation, GLTR comparison, scaling, collapse, and a candid threats-to-validity section.

Download PAPER.pdf

Reproducible by design

  • Core runs on the Python standard library — zero install.
  • Real corpora fetched over HTTP (HC3, RAID).
  • Fixed seed (1234); one isolated venv only for GPT-2 / neural collapse.
detector.pylomo.pycrossmodel.pycollapse_modern.pygltr_baseline.py