Contamination Detection Is Not Collapse Monitoring

The central claim

Two questions that get conflated

Flagging an AI document and warning that a corpus is collapsing sound like the same tool. They are not — and different signals answer them.

AI-Text Detection

operates on ONE document→“is this text synthetic?”

Collapse Monitoring

operates on the DISTRIBUTION→“is the corpus losing diversity?”

Signal / metric	Flag one AI document	Detect corpus collapse
Perplexity (fixed LM)	Good	Good
Per-document detector score	Good	Poor
Vocabulary size	Poor	Excellent
Distinct n-gram diversity	Poor	Excellent

The detector is a deliberately simple instrument — the SyntheticTextProbe — not the contribution. "Habsburg AI" names the phenomenon we study (model collapse), after Shumailov et al. (2023).

The instrument

Six interpretable signals

Each normalized to 0–1 and combined with a fixed weight into a 0–100 synthetic-likelihood. Three are inverted so a higher contribution always means more synthetic. Pure Python standard library.

w 0.20

Filler density

Canned filler phrases ("it is important to note…") per ~50 tokens.

w 0.18

Repetition

Fraction of trigrams that are repeats — looping phrasing.

w 0.16

Lexical diversity

Type-token ratio. Low diversity → more synthetic.

w 0.16

N-gram diversity

Distinct-trigram ratio. Low → more synthetic.

w 0.15

Burstiness

Sentence-length variation. Uniform lengths → more synthetic.

w 0.15

Duplicate similarity

Overlap with a corpus of known model outputs — catches recycling.

Live demo

Score your own text

A faithful JavaScript port of the detector's six signals, running entirely in your browser. Paste text or load a preset.

Human-style text AI-style text Clear

0

synthetic-likelihood / 100

—

Results · real data

Detection on real human-vs-LLM text

The headline — generalizing to unseen generators

Leave-one-model-out AUC on the held-out generator (reference seed 1234).
Held-out	Classifier	Perplexity	Zero-shot
Mistral-chat	0.981	0.965	0.784
ChatGPT	0.976	0.968	0.848
GPT-4	0.907	0.868	0.734
Cohere	0.837	0.862	0.636
GPT-2	0.834	0.840	0.594
Mean	0.924	0.911	0.732

Significant across 10 seeds

Mean 0.915 ± 0.006 (95% CI [0.911, 0.920]) vs perplexity 0.903 ± 0.007 — non-overlapping CIs. Paired permutation test: +0.013, wins 10/10 seeds, p = 0.002.

Leave-one-model-out AUC — The trained classifier generalizes to unseen generators, beating perplexity alone and the zero-shot heuristic.

An honest comparison — GLTR beats the heuristics

Cross-model AUC — One fixed zero-shot detector vs eight generators: strong on chat models, near chance on GPT-2 base completions.

Zero-shot AUC. DetectGPT-curvature & GPTZero aren't run (white-box / paid API).
Detector (zero-shot)	HC3	RAID
SyntheticTextProbe (heuristics)	0.915	0.724
GLTR — GPT-2 log-prob	0.995	0.862
GLTR — GPT-2 top-10 rank	0.995	0.848

This strengthens the thesis

A white-box likelihood detector clearly wins for detection — but it is also per-document, so it is equally blind to collapse. The gap isn't an artifact of a weak detector.

Model collapse

Watching a model eat its own output

We retrain a model on the text it just generated, generation after generation, and instrument it with the same detector and diversity metrics.

Neural collapse curves — Neural collapse (distilGPT-2). Left: diversity falls and the detector creeps up. Right: reference perplexity crashes 69 → 7 — the cleanest collapse signal in the project.

Holds at 100× scale

Scaled the corpus to 150 / 1,500 / 15,000 docs on two datasets. Larger corpora collapse more slowly (vocab loss −96% → −32%), but in all six runs the distribution metrics still fall and the detector stays flat — contamination 0% everywhere.

Holds for a modern LLM

Not just 2019-era distilGPT-2: Qwen2.5-0.5B (2024) LoRA-fine-tuned recursively — reference perplexity 37 → 2.9, vocab −87% — yet the detector stays flat (17.6 → 18.4), contamination 2%.

Honest findings

What's true — including where the prediction was wrong

✓ Cheap features generalize

A 7-feature classifier reaches 0.915 on generators it never trained on, beating heuristics and perplexity alone.

~ Zero-shot transfer is uneven

One fixed detector ranges 0.57–0.83: strong on chat models, near chance on base GPT-2.

✗ GPT-2 perplexity didn't win

Predicted it would raise the floor; it didn't (0.895 vs 0.924). Reported with its caveat, not spun away.

✗ The detector misses collapse

Collapse is real (perplexity 69→7) but per-document detection barely moves — a different problem entirely.

Figures

Gallery

Collapse at scale: diversity falls at every size; detector stays flat.

Paper & reproducibility

Read it, run it

The paper

A 17-page write-up: detector, baselines, LOMO classifier, ablation, GLTR comparison, scaling, collapse, and a candid threats-to-validity section.

Download PAPER.pdf

Reproducible by design

Core runs on the Python standard library — zero install.
Real corpora fetched over HTTP (HC3, RAID).
Fixed seed (1234); one isolated venv only for GPT-2 / neural collapse.

detector.pylomo.pycrossmodel.pycollapse_modern.pygltr_baseline.py