Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.
Authorship attribution is the forensic effort to identify who wrote a text by measuring idiosyncratic linguistic habits. This topic covers the theoretical foundation of the idiolect, the features analysts extract, and how closed-set and open-set problems differ.
Last updated:
Every text carries a fingerprint, not a literal one, but a statistical signature built from thousands of tiny linguistic choices the writer made without thinking about them. Authorship attribution is the forensic discipline that reads this signature and uses it to answer a deceptively simple question: who wrote this? The question arises in threatening-letter investigations, will-contest litigation, plagiarism hearings, intelligence analysis, and literary scholarship, and the methods brought to each setting share a common theoretical backbone.
The backbone is the idiolect assumption: the claim that every person's language is shaped by a unique accumulation of influences, geography, schooling, profession, reading habits, age, and social network, and that this accumulation produces measurable regularities stable enough to distinguish one writer from another. The assumption is not romantic speculation. It is a testable hypothesis, and decades of research have established which features make it reliable and under what conditions it breaks down.
This topic builds the conceptual foundation. It defines what authorship attribution is trying to do, maps the two main problem types (closed-set and open-set), explains why function words and n-gram features carry the signal, and walks through the feature extraction workflow that turns raw text into a form a statistical classifier can use. The goal is to understand not just the techniques but the logical constraints they operate under, because those constraints are exactly what courtroom challenges probe.
A questioned document and an unknown author: what exactly is being asked?
At its core, authorship attribution is an inference problem. You have a text of unknown or disputed origin, the questioned document, and you have some reference writing from one or more candidate authors. The question is whether the statistical properties of the questioned text match any candidate well enough to make an attribution claim. The qualifier 'well enough' is where almost all the methodological difficulty lives.
The problem comes in two quite different flavours. Closed-set attribution assumes the true author is in the candidate pool and asks only which candidate is the best match. This is the easier version, but it carries a hidden danger: if the actual author is not in the pool, the method will still pick the closest candidate and may return a wrong result with spurious confidence. Open-set attribution adds a rejection option, but setting the rejection threshold correctly requires knowing the typical distance between unrelated writers, which requires a large and well-characterised population corpus.
Why everyone writes differently, even when writing about the same thing.
The idiolect is not a choice. It is the sediment of every linguistic encounter a person has had: the dialect of the household they grew up in, the vocabulary of their profession, the syntactic patterns their teachers reinforced, the books they read, the communities they belong to. These influences layer over time and produce a stable statistical signature. The operative word is stable. Individuals shift register considerably, from an informal email to a legal brief, but the distribution of function words, the preference for certain phrase constructions, and the habitual spelling variants persist across registers in ways that content vocabulary does not.
This is the theoretical reason function words dominate attribution research. A writer consciously controls topic vocabulary. They do not consciously control whether they prefer 'which' or 'that' in relative clauses, or whether they tend toward sentence-opening conjunctions, or whether they systematically insert or omit the Oxford comma. These choices happen below the level of deliberate composition and are therefore harder to fake, harder to vary deliberately, and more likely to remain consistent across different texts from the same author.
Turning prose into numbers without losing the linguistic signal.
Feature extraction converts a text into a vector of measurements. The choice of features is arguably the most consequential decision in the entire workflow, because the classifier can only exploit signal that the features encode. Researchers have tested hundreds of feature types; the ones that consistently perform across languages and text types are function words, character n-grams, word n-grams, and part-of-speech (POS) n-grams.
| Feature type | What it captures | Strengths | Weaknesses |
|---|---|---|---|
| Function-word frequencies | Relative use rate of ~200-500 grammatical words | Stable across topics, hard to forge deliberately | Requires sufficient text length; language-specific lists |
| Character n-grams | Sub-word sequences including punctuation and spaces | Captures spelling and morphological habits; topic-invariant | Large feature vector; noise from proper nouns |
| Word n-grams | Sequences of consecutive word tokens | Captures phrase and collocation preferences | Sensitive to topic; large vocabulary sparsity at n>2 |
| POS n-grams | Sequences of grammatical categories | Captures syntactic preference without vocabulary bias | Requires reliable POS tagger; cross-language tagging inconsistency |
Character n-grams deserve particular attention because they work well even with short texts and across languages. A character trigram like '_th' (underscore for space) captures article-noun patterns; 'ing' captures verb-form preferences; '-ly' tracks adverb density. None of these requires the analyst to decide which words are 'function words' in a given language, which makes the approach portable.
From raw text to a classifier-ready vector, step by step.
What attribution can and cannot claim in a legal setting.
Authorship attribution produces a probabilistic claim, not an identity proof. Even the best-validated methods operate with error rates that increase as text length falls, as the number of candidates rises, and as the gap between the questioned document and the reference corpus widens in time or register. These limits are not failures of the method; they are its honest operating envelope.
Courts in many countries have been sceptical of attribution evidence where the analyst could not state a reliable error rate, where the comparison corpus was small or poorly matched, or where the method had not been independently validated. The most defensible reports state clearly: the questioned text is more consistent with candidate A than with the other candidates tested, with a likelihood ratio estimated from cross-validation on a population corpus of N texts. They do not claim certainty.
What distinguishes an open-set attribution problem from a closed-set one?
Test yourself on Forensic Linguistics with free, timed mocks.
Practice Forensic Linguistics questionsSpotted an error in this page? Report a correction or read our editorial standards.