Authorship Attribution: Principles and Methods

Authorship attribution is the forensic effort to identify who wrote a text by measuring idiosyncratic linguistic habits. This topic covers the theoretical foundation of the idiolect, the features analysts extract, and how closed-set and open-set problems differ.

Last updated: 19 Jun 2026

Authorship attribution is the forensic discipline that determines who wrote a questioned document by measuring idiosyncratic linguistic habits in that text against reference writing from known authors. The theoretical foundation is the idiolect: every person's language is shaped by a unique accumulation of geography, education, profession, and reading history, and that accumulation produces statistical regularities stable enough to distinguish one writer from another. Attribution analyses most reliably exploit function words and character n-grams, which operate below the writer's conscious control and persist across register shifts. The discipline distinguishes two core problem types: closed-set attribution, where the true author is assumed to be one of a defined candidate pool, and open-set attribution, where the true author may not appear in the pool at all.

Every text carries a statistical signature built from thousands of linguistic choices the writer made without conscious deliberation. Authorship attribution is the forensic discipline that reads this signature to answer a practical question: who wrote this? The question arises in threatening-letter investigations, will-contest litigation, plagiarism hearings, intelligence analysis, and literary scholarship, and the methods brought to each setting share a common theoretical backbone.

The backbone is the idiolect assumption: the claim that every person's language is shaped by a unique accumulation of influences, geography, schooling, profession, reading habits, age, and social network, and that this accumulation produces measurable regularities stable enough to distinguish one writer from another. The assumption is not romantic speculation. It is a testable hypothesis, and decades of research have established which features make it reliable and under what conditions it breaks down.

This topic builds the conceptual foundation. It defines what authorship attribution is trying to do, maps the two main problem types (closed-set and open-set), explains why function words and n-gram features carry the signal, and walks through the feature extraction workflow that turns raw text into a form a statistical classifier can use. The goal is to understand not just the techniques but the logical constraints they operate under, because those constraints are exactly what courtroom challenges probe.

By the end of this topic you will be able to:

Distinguish closed-set from open-set attribution and explain why the distinction matters for interpreting reported accuracy figures in court.
Explain the idiolect assumption and identify the linguistic features that make it empirically testable rather than speculative.
Describe why function words and character n-grams carry stronger attribution signal than content vocabulary.
Trace the feature extraction pipeline from raw text to classifier-ready vector, identifying the documented decisions required at each step.
Articulate the limits of attribution evidence, including the effect of text length, candidate pool size, and deliberate style disguise on error rates.

Key terms

Idiolect: An individual's unique variety of language, shaped by geography, education, profession, and personal history. The idiolect is the theoretical object authorship attribution is trying to measure.
Closed-set attribution: An attribution task where the true author is assumed to be one of a defined list of candidates. The system ranks candidates; it does not need to handle the possibility that none of them is the author.
Open-set attribution: An attribution task where the true author may or may not appear in the candidate pool. The system must both rank candidates and decide whether any candidate is a credible match, which requires a rejection threshold.
Function words: Grammatically obligatory words such as articles, prepositions, conjunctions, and pronouns. Used at high frequency and largely unconsciously, they are among the strongest attribution features available.
N-gram: A contiguous sequence of n items (characters, words, or part-of-speech tags) extracted from text. Character n-grams and word n-grams are both standard attribution features.
Feature extraction: The process of converting raw text into a numerical vector of linguistic measurements. The choice of features determines what signal the classifier can see, and is as consequential as the classifier itself.

The authorship problem defined

At its core, authorship attribution is an inference problem. You have a text of unknown or disputed origin, the questioned document, and you have some reference writing from one or more candidate authors. The question is whether the statistical properties of the questioned text match any candidate well enough to make an attribution claim. The qualifier 'well enough' is where almost all the methodological difficulty lives.

The problem comes in two distinct forms. Closed-set attribution assumes the true author is in the candidate pool and asks only which candidate is the best match. This is the easier version, but it carries a hidden danger: if the actual author is not in the pool, the method will still pick the closest candidate and may return a wrong result with spurious confidence. Open-set attribution adds a rejection option, but setting the rejection threshold correctly requires knowing the typical distance between unrelated writers, which requires a large and well-characterised population corpus.

The idiolect assumption

The idiolect is not a choice. It is the sediment of every linguistic encounter a person has had: the dialect of the household they grew up in, the vocabulary of their profession, the syntactic patterns their teachers reinforced, the books they read, the communities they belong to. These influences layer over time and produce a stable statistical signature. The operative word is stable. Individuals shift register considerably, from an informal email to a legal brief, but the distribution of function words, the preference for certain phrase constructions, and the habitual spelling variants persist across registers in ways that content vocabulary does not.

This is the theoretical reason function words dominate attribution research. A writer consciously controls topic vocabulary. They do not consciously control whether they prefer 'which' or 'that' in relative clauses, or whether they tend toward sentence-opening conjunctions, or whether they systematically insert or omit the Oxford comma. These choices happen below the level of deliberate composition and are therefore harder to fake, harder to vary deliberately, and more likely to remain consistent across different texts from the same author.

Authorship signal by word type: function words carry stronger, more stable signal than content words.

Features for attribution: n-grams and beyond

Feature extraction converts a text into a vector of measurements. The choice of features is arguably the most consequential decision in the entire workflow, because the classifier can only exploit signal that the features encode. Researchers have tested hundreds of feature types; the ones that consistently perform across languages and text types are function words, character n-grams, word n-grams, and part-of-speech (POS) n-grams.

Feature type	What it captures	Strengths	Weaknesses
Function-word frequencies	Relative use rate of ~70-500 grammatical words	Stable across topics, hard to forge deliberately	Requires sufficient text length; language-specific lists
Character n-grams	Sub-word sequences including punctuation and spaces	Captures spelling and morphological habits; topic-invariant	Large feature vector; noise from proper nouns
Word n-grams	Sequences of consecutive word tokens	Captures phrase and collocation preferences	Sensitive to topic; large vocabulary sparsity at n>2
POS n-grams	Sequences of grammatical categories	Captures syntactic preference without vocabulary bias	Requires reliable POS tagger; cross-language tagging inconsistency

Character n-grams are especially useful because they perform well even with short texts and across languages. A character trigram like '_th' (underscore for space) captures article-noun patterns; 'ing' captures verb-form preferences; '-ly' tracks adverb density. None of these requires the analyst to decide which words are 'function words' in a given language, which makes the approach portable.

Feature extraction workflow

Text acquisition and normalisation
Collect the questioned document and the reference corpus. Normalise encoding, handle OCR errors, and decide how to treat metadata (headers, timestamps). Document every decision: different normalisation choices can shift results.
Tokenisation
Split text into words or characters as required by the chosen features. For word-level features, decide whether to preserve punctuation as separate tokens, how to handle contractions, and whether case folding is appropriate.
Feature computation
Count occurrences of each feature type across the text. Normalise by text length to produce relative frequencies rather than raw counts, so short and long texts are comparable.
Dimensionality and feature selection
Many feature sets produce thousands of dimensions. Reduce via most-frequent-word selection, principal component analysis, or statistical significance tests. Retaining too many features risks over-fitting to the training data.
Distance or classification
Compute the distance between the questioned document's vector and each candidate's vector (stylometry), or train a classifier on the candidate corpora and predict the questioned document's class. The method choice affects interpretability in court.

Feature extraction pipeline from raw text to attribution decision.

Limits and the evidence question

Authorship attribution produces a probabilistic claim, not an identity proof. Even the best-validated methods operate with error rates that increase as text length falls, as the number of candidates rises, and as the gap between the questioned document and the reference corpus widens in time or register. These limits are not failures of the method; they are its honest operating envelope.

Courts in many countries have been sceptical of attribution evidence where the analyst could not state a reliable error rate, where the comparison corpus was small or poorly matched, or where the method had not been independently validated. The most defensible reports state clearly: the questioned text is more consistent with candidate A than with the other candidates tested, with a likelihood ratio estimated from cross-validation on a population corpus of N texts. They do not claim certainty.

Worked example

Authorship of a threatening workplace email

Closed-set attribution with function-word frequencies and character trigrams.

A company receives a threatening email sent from an anonymous free-mail account. The security team has identified four employees who had both the access and the grievance to send it. Each of the four has authored substantial internal email correspondence over the preceding year, giving the analyst reference corpora of 3,000-8,000 words per candidate.

The questioned email is 320 words, short but within the workable range. The analyst extracts a feature vector using the 150 most frequent function words and character trigrams from the combined candidate corpus.
Each candidate's reference corpus is segmented into 320-word blocks to match the questioned document length. Feature vectors are computed for every block, giving a distribution of vectors per candidate.
The Burrows's Delta distance between the questioned vector and each candidate's centroid is computed. Candidate B shows the smallest Delta distance, with a margin large enough that the distribution overlap is minimal.
Cross-validation on the reference corpora gives an error rate of 12 per cent at this text length for this candidate set. The analyst reports: the questioned email is most consistent with Candidate B's writing style. The result does not exclude the possibility of another author outside the four-person set, and the 12 per cent error rate applies to the closed-set result.

What is not claimed is certainty that B wrote the email, or that the other three are definitively excluded. The result is one input to the investigation, to be weighed against access logs, device data, and behavioural evidence. Forensic linguistic reports should maintain this proportionality.

Check your understanding

Question 1 of 4· 0 answered

What distinguishes an open-set attribution problem from a closed-set one?

Key Takeaways

Authorship attribution rests on the idiolect assumption: every person's language habits are unique and stable enough to leave a measurable signature in their writing.
Closed-set attribution ranks known candidates; open-set attribution must also decide whether any candidate is a plausible match, a harder problem with higher stakes for false attribution.
Function words are the dominant attribution feature because they are high-frequency, syntactically obligatory, and selected below the level of conscious stylistic control.
Character n-grams capture spelling, morphological, and punctuation habits without requiring language-specific word lists, making them portable across languages and genres.
Feature extraction converts text to comparable numerical vectors; every step, tokenisation, normalisation, feature selection, requires documented choices that must be reproducible.
Reliable attribution requires sufficient text length (typically 500+ words) and a well-matched reference corpus; forensic reports should state the error rate from cross-validation rather than claiming certainty.

What is the idiolect assumption in authorship attribution?

The idiolect assumption holds that every person's language use is shaped by a unique mix of geography, education, age, profession, and reading history. That combination is stable enough that it shows up consistently in writing, especially in high-frequency grammatical choices that operate below conscious awareness.

Why are function words better attribution markers than content words?

Function words such as 'the', 'of', and 'which' are syntactically obligatory, used at very high frequencies, and chosen largely outside the writer's conscious control. A forger can change vocabulary, but maintaining someone else's natural function-word rhythm across thousands of words is nearly impossible.

What is the difference between closed-set and open-set attribution?

In closed-set attribution the true author is known to be one of a fixed list of candidates, so the task is ranking. In open-set attribution the true author may not be in the candidate pool at all, so the system must also decide whether any candidate is a plausible match, a much harder problem with a higher false-positive risk.

What are character n-grams and why do they work for attribution?

Character n-grams are sequences of n consecutive characters, including spaces and punctuation. They capture spelling preferences, hyphenation habits, and morphological patterns without requiring any linguistic pre-processing. Because they operate at a sub-word level they hold up well across topics, which makes them useful across short and topically diverse texts.

How much text is needed for reliable attribution?

There is no hard minimum, but most published benchmarks show accuracy dropping sharply below about 500 words of questioned text. The reference corpus from known writings also needs to be large enough and topically varied enough to capture the author's range rather than a single style snapshot.

Test yourself on Forensic Linguistics with free, timed mocks.

Practice Forensic Linguistics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.